Polish dataset download

aiteam · January 25, 2020, 10:07am

Dear all,
is there posibility to download dataset that’s not completed yet?

Thank You

nukeador · February 3, 2020, 1:16pm

Hello,

What do you mean by “not completed”? We are currently releasing the dataset on fixed intervals and we are working on getting them released more often.

Cheers.

aiteam · February 3, 2020, 3:19pm

Hello Ruben,

can You tell me then howto download Polish dataset?

Best Regards

nukeador · February 3, 2020, 4:13pm

Polish has 64 hours validated but they were collected in early 2020.

Our last dataset release only includes data collected until December 10th.

Polish dataset will be released in our next dataset release (we don’t have a date yet, we are working on that)

aiteam · February 3, 2020, 5:12pm

Thank You for information, we are contributing hard to complete our dataset

Best Regards

nukeador · February 3, 2020, 9:53pm

Please, note that Polish only has 8400 sentences, so current voice contributions are repetitions (that we should avoid for a higher quality dataset)

I recommend to check this topic and circulate with technical contributors so we can import a big number of sentences to Polish and have room for more voice recordings:

Thanks for your contributions!

aiteam · February 4, 2020, 11:10am

Ok then, we will prepare batch from polish wikipedia - according
to instructions You provide. Can You tell me where can I find list
and number of Polish sentences?

BR

nukeador · February 4, 2020, 11:11am

You can see some stats if you add Polish as your language on the sentence collector:

https://common-voice.github.io/sentence-collector/#/

Raw data is stored here:

aiteam · February 4, 2020, 12:26pm

Thanks!

How often do You update raw data on github?

  Does validated sentences means, that they are recorded and

approved within contributions on ?

nukeador · February 4, 2020, 12:29pm

This is usually exported weekly from sentence collector + any additional source imports (like the wikipedia process I recommend instead of manually collecting sentences)

Sentences need to be added and positively reviewed by two additional contributors to be “validated”.

aiteam · February 4, 2020, 12:44pm

Thanks for Your quick answer, at this moment we are preparing
large batch for collector from polish wikipedia. See You soon then

BR

nukeador · February 4, 2020, 12:48pm

I don’t know if I understood this correctly. But I want to note that wikipedia extraction is a separate process that can’t be added to the sentence collector, as explained in the topic I linked, it has its own process.

Just making sure we are on the same page

Thanks again!

aiteam · February 4, 2020, 1:08pm

I know that, atm we are scrapping polish wiki - and filtering
sentences via wordlists and all those things, that You provide mi
in technical article. After that we will prepare it to add it to
sentece collector.

BR

madziszyn · February 4, 2020, 1:39pm

Great news @aiteam, i wanted to do it myself. I am willing to review those sentences (as much as i can ofc), please, just ping me and others when you finish.

nukeador · February 4, 2020, 1:55pm

I’ll reinforce my previous message, note that wikipedia extracted sentences are never imported into the sentence collector, please don’t do this.

The process is to contribute to the rulesets and blacklist:

And then the team will run the extraction and incorporate to Common Voice repo.

Thanks!

aiteam · February 4, 2020, 3:01pm

ahh. sh.t - i didnt understand Your idea. Do You already have
rulesets and blacklists for polish wikipedia scrapping database?

aiteam · February 4, 2020, 3:05pm

it seems that at beginning i didnt understand model for wiki
scrapping and collector - i’ll try to get net source of polish
datasets to review

nukeador · February 4, 2020, 3:32pm

No, we don’t, we need technical Polish-speaking contributors to help

aiteam · February 4, 2020, 8:42pm

- if u want me to prepare rules and blacklist just let me know

jakub.wrobel7 · February 5, 2020, 9:03pm

Let’s do it together, see here https://discourse.mozilla.org/t/coordination-of-input-for-polish-language-wiki-scrapper/53380

Topic		Replies	Views
Polish sentences concerns Common Voice sentence-collection , issue , dataset	20	3278	May 4, 2020
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3688	September 11, 2019
Using the Europarl Dataset with sentences from speeches from the European Parliament Common Voice sentence-collection	61	6031	March 28, 2023
Polish dataset from Europarl - help needed Common Voice	14	1187	July 17, 2021
📖 Readme: How to see my language on Common Voice Common Voice announcements	40	14139	May 10, 2022

Polish dataset download

Related topics