Polish dataset download

nukeador · February 4, 2020, 1:55pm

I’ll reinforce my previous message, note that wikipedia extracted sentences are never imported into the sentence collector, please don’t do this.

The process is to contribute to the rulesets and blacklist:

And then the team will run the extraction and incorporate to Common Voice repo.

Thanks!

aiteam · February 4, 2020, 3:02pm

ahh. sh.t - i didnt understand Your idea. Do You already have
rulesets and blacklists for polish wikipedia scrapping database?

aiteam · February 4, 2020, 3:05pm

it seems that at beginning i didnt understand model for wiki
scrapping and collector - i’ll try to get net source of polish
datasets to review

nukeador · February 4, 2020, 3:32pm

No, we don’t, we need technical Polish-speaking contributors to help

aiteam · February 4, 2020, 8:42pm

- if u want me to prepare rules and blacklist just let me know

jakub.wrobel7 · February 5, 2020, 9:03pm

Let’s do it together, see here https://discourse.mozilla.org/t/coordination-of-input-for-polish-language-wiki-scrapper/53380

aiteam · February 6, 2020, 3:11pm

Hey - is there any hard limit (10 000) on sentence collector?.
Coz I’m adding and also renewing new sentences, while counters on
profile doesn’t change:

Profile: aiteam

10000 sentences added
10000 sentences reviewed
… across 1 language(s)

jakub.wrobel7 · February 6, 2020, 9:18pm

Hello @aiteam,
a lot of contribution You have made, can You please add a better source description of sentences origin? “own choosed and edited” may not indicate that sentences are copyright free in my opinion.
Some sentences seem broken and it would be a bit hard to read and process them. Like this one: “Wyskoczy do klubu tury…iej wielkości gwiazd.”
Also I am not sure if multiple sentences in one sentence string is ok for the dataset (like this one “Wytrzeźwiej. Odpocznij. I przyjdź do mnie.”)

It is very nice that You have made a lot of contribution but keep in mind we also aim for some quality of the dataset ;). So if You can somehow pre-process the sentences it would reduce review time and percentage of rejected sentences.

If it is possible to refine those sentences which You have uploaded by a script, maybe it would be good to actually remove them from the sentence collector for now and add after refinement?

aiteam · February 7, 2020, 6:35am

Hey Kuba,

  i've build semi automatic solution to extract sentences (within

nltk) from old movie subtitles. Process is not fully automated and
i’m trying to review sentences before i’ll add it to collector
(sometimes i can miss low quality senetce). I’ll try to modify
process - it will take only those sentences, in which words are in
polish dictionary only (btw my blacklist is still growing, so
quality of batch should be better each iteration).

  Do I need to write movie title within description of sentences

batch?

BR

aiteam · February 7, 2020, 6:47am

Kuba,

  what about counters on profile - is't safe to add new sentences?

BTW I’ve reviewed my code - i had english pickle on tokenizer
so new batches should be much more better now (now process is
armed with polish pickle)

BR

stergro · February 7, 2020, 10:25am

No this this just a display problem of the website, see here and here. You can add as much sentences as you want.

aiteam · February 7, 2020, 10:40am

Hey Stefan

  thanks for answer, got another issue - why total number of

sentences doesnt change also? Is it same issue like counters on
profile?

25097 total sentences.
best regards

Adrijaned · February 7, 2020, 2:00pm

Hey, how old movies are we talking? Please take note that if the movie was released any sooner than 1950, or probably even 1940, the scripts, and, in effect, subtitles, are still probably protected by author rights, and thus unsuitable for common voice.

aiteam · February 7, 2020, 2:54pm

Hey, most of them are from 90’ - but the group who made
translations doesnt exists anymore

aiteam · February 7, 2020, 2:57pm

btw if You have ideas where from I can take source to make
sentences just let me know

Adrijaned · February 7, 2020, 3:03pm

You may consider manually scraping some stuff from https://pl.wikisource.org/wiki/Wikiźródła:Strona_główna, they have a fair collection of usually guaranteedly CC0 texts.

aiteam · February 7, 2020, 3:08pm

I had tried to add sentence from another profile and number of
added senteces didn’t increase

aiteam · February 7, 2020, 3:09pm

Ruben told to not add sentences in collector from wikipedia …

Adrijaned · February 7, 2020, 3:16pm

Wikisources is a different project by the Wikimedia foundation, and serves basically only as a database of PD works. The restriction on Wikipedia is only on the main project itself (and possibly some of the other projects), and that is due to the reason that the content there is originally created by the Wikipedia contributors, who retain the rights to their content under the CC-BY-SA license or some similar (and thus, it is not CC0)

aiteam · February 7, 2020, 3:28pm

thx for the hint !!! i’m going to prepare batch from
wikisource then