Polish dataset download

ok- ill put source url of book/novel. Dont worry bout wolne
lektury i’m scrapping wikisource atm.

bout formating - maybe it’s mozillas thunderbird issue :frowning:

I can only find sentences with the source “own choosed and edited”, can you be more specific here?

that was eactly what i had meant

This is done. I haven’t run an export yet, but the deletion in the voice-web repo will be part of the next export.

Can we use GNU GPL v3 dataset? I am thinking about NKJP

Unfortunately sources have to be CC-0 only.

Sources have to be CC0 - compatible for the use in Common Voice. GNU GPL specifically would forbid a few possible usecases for the resulting dataset due to it’s share-alike policy.

I Can see transcriptions there on the GitHub (https://github.com/mozilla/voice-web/tree/master/server/data/pl)
Can I also get corresponding audio files somewhere right now, before the official corpus is released?
Best regards

Currently the only process we have right now is the full dataset publication, but we are working on improving this year so we can have more frequent or continuous access to the updated dataset.

Also please note these are not “transcriptions”, these are sentences to record. On Common Voice people read sentences, we don’t transcribe voices.