Polish dataset download

aiteam · February 7, 2020, 3:08pm

I had tried to add sentence from another profile and number of
added senteces didn’t increase

aiteam · February 7, 2020, 3:09pm

Ruben told to not add sentences in collector from wikipedia …

Adrijaned · February 7, 2020, 3:16pm

Wikisources is a different project by the Wikimedia foundation, and serves basically only as a database of PD works. The restriction on Wikipedia is only on the main project itself (and possibly some of the other projects), and that is due to the reason that the content there is originally created by the Wikipedia contributors, who retain the rights to their content under the CC-BY-SA license or some similar (and thus, it is not CC0)

aiteam · February 7, 2020, 3:28pm

thx for the hint !!! i’m going to prepare batch from
wikisource then

aiteam · February 7, 2020, 3:48pm

Ok -can You cleanup my sentences from movie subtitles then?
(aiteam) - i’m preparing new batch from wikisources

Adrijaned · February 7, 2020, 5:33pm

Not me. @mkohler could you please unless you find a better way?

mkohler · February 7, 2020, 5:38pm

Will do this weekend.

jakub.wrobel7 · February 7, 2020, 9:19pm

@aiteam, please remember to add source of text when uploading them .
Also You can read this topic (among many others) which has some useful info for processing sentences https://discourse.mozilla.org/t/help-needed-in-processing-large-polish-text-base/40778. Since it seems You are knowledgeable in NLP and stuff (at least more than me, an amateur) maybe You would like to share your scripts with the community? I am sure we will find more sentence sources waiting to get processed in many lanuages.

aiteam · February 7, 2020, 9:38pm

Hey Kuba,

  what type of infromation should be in source? (link, type eg.

wikisource, title of the novel/book)?

  NLP is in my case is just beginning ;) when i'll be ready i'll

publish my sources

BR

jakub.wrobel7 · February 7, 2020, 9:50pm

Good to hear about NLP . Well I think material/book title and author would be good and maybe short url of website? When uploading texts from wolnelektury.pl I tend to put something like: Author;Title;wolnelektury.pl. And by the way You can skip this site as I already downloaded most of what was useful there.

Digression, is it only me or Your posts have odd formatting? To me they show as normal text

interupted by such text

aiteam · February 7, 2020, 9:57pm

ok- ill put source url of book/novel. Dont worry bout wolne
lektury i’m scrapping wikisource atm.

bout formating - maybe it’s mozillas thunderbird issue

mkohler · February 8, 2020, 4:23pm

I can only find sentences with the source “own choosed and edited”, can you be more specific here?

aiteam · February 9, 2020, 9:20am

that was eactly what i had meant

mkohler · February 9, 2020, 3:52pm

This is done. I haven’t run an export yet, but the deletion in the voice-web repo will be part of the next export.

grotos · February 22, 2020, 5:15pm

Can we use GNU GPL v3 dataset? I am thinking about NKJP

dabinat · February 22, 2020, 6:18pm

Unfortunately sources have to be CC-0 only.

Adrijaned · February 22, 2020, 6:20pm

Sources have to be CC0 - compatible for the use in Common Voice. GNU GPL specifically would forbid a few possible usecases for the resulting dataset due to it’s share-alike policy.

Tomasz_Zietkiewicz · April 8, 2020, 7:43pm

Hi,
I Can see transcriptions there on the GitHub (https://github.com/mozilla/voice-web/tree/master/server/data/pl)
Can I also get corresponding audio files somewhere right now, before the official corpus is released?
Best regards

nukeador · April 13, 2020, 11:05am

Currently the only process we have right now is the full dataset publication, but we are working on improving this year so we can have more frequent or continuous access to the updated dataset.

nukeador · April 13, 2020, 11:06am

Also please note these are not “transcriptions”, these are sentences to record. On Common Voice people read sentences, we don’t transcribe voices.