Polish dataset download

Dear all,
is there posibility to download dataset that’s not completed yet?

Thank You

1 Like

Hello,

What do you mean by “not completed”? We are currently releasing the dataset on fixed intervals and we are working on getting them released more often.

Cheers.

Hello Ruben,

can You tell me then howto download Polish dataset?

Best Regards

Polish has 64 hours validated but they were collected in early 2020.

Our last dataset release only includes data collected until December 10th.

Polish dataset will be released in our next dataset release (we don’t have a date yet, we are working on that)

Thank You for information, we are contributing hard to complete our dataset :wink:

Best Regards

1 Like

Please, note that Polish only has 8400 sentences, so current voice contributions are repetitions (that we should avoid for a higher quality dataset)

I recommend to check this topic and circulate with technical contributors so we can import a big number of sentences to Polish and have room for more voice recordings:

Thanks for your contributions!

Ok then, we will prepare batch from polish wikipedia - according
to instructions You provide. Can You tell me where can I find list
and number of Polish sentences?

BR

You can see some stats if you add Polish as your language on the sentence collector:

https://common-voice.github.io/sentence-collector/#/

Raw data is stored here:

Thanks!

How often do You update raw data on github?

  Does validated sentences means, that they are recorded and

approved within contributions on ?

This is usually exported weekly from sentence collector + any additional source imports (like the wikipedia process I recommend instead of manually collecting sentences)

Sentences need to be added and positively reviewed by two additional contributors to be “validated”.

Thanks for Your quick answer, at this moment we are preparing
large batch for collector from polish wikipedia. See You soon then

BR

I don’t know if I understood this correctly. But I want to note that wikipedia extraction is a separate process that can’t be added to the sentence collector, as explained in the topic I linked, it has its own process.

Just making sure we are on the same page :slight_smile:

Thanks again!

I know that, atm we are scrapping polish wiki - and filtering
sentences via wordlists and all those things, that You provide mi
in technical article. After that we will prepare it to add it to
sentece collector.

BR

Great news @aiteam, i wanted to do it myself. I am willing to review those sentences (as much as i can ofc), please, just ping me and others when you finish. :slightly_smiling_face:

I’ll reinforce my previous message, note that wikipedia extracted sentences are never imported into the sentence collector, please don’t do this.

The process is to contribute to the rulesets and blacklist:

And then the team will run the extraction and incorporate to Common Voice repo.

Thanks!

ahh. sh.t - i didnt understand Your idea. Do You already have
rulesets and blacklists for polish wikipedia scrapping database?

it seems that at beginning i didnt understand model for wiki
scrapping and collector :wink: - i’ll try to get net source of polish
datasets to review

No, we don’t, we need technical Polish-speaking contributors to help :slight_smile:

:slight_smile: - if u want me to prepare rules and blacklist just let me know

1 Like

Let’s do it together, see here https://discourse.mozilla.org/t/coordination-of-input-for-polish-language-wiki-scrapper/53380