So I noticed that Polish got launched on voice.mozilla.org. When I attempted to record and check recordings I found quite a few sentences that were plain wrong.
A couple examples are:
Meduzy nie bieskie i różowe kołysały się jak kwiaty o grubych i miękkich koronach.
Słyszał jej mruczenie bulgocące cicho w gardle, jak.
Chce czegoś bardzo, tylko narazie nie wie czego.
Czekam, aż obrazy podniosą się przede mną, jak przed zaklinaczem wężę.
Wszyscy trzej towarzysze za jej przykładem zerwali po tataraku i gryząc wilgotne
The mistakes here in order are:
- “nie bieskie” - should be niebieskie, as in colour, not “nie bieskie” as in “not bieskie” which is not a word,
- Ends in the middle of sentence (but weirdly enough has a dot at the end),
- “narazie” should be “na razie”,
- “wężę” - should be “węże”, that’s a word form/grammatical error,
- Also ends weirdly in the middle.
All of these were approved by the same people through sentence collector. My concern here is that it may have been some sort of blanket approval without really reading sentences, as about two weeks ago I remember the sentences suddenly being approved at a very high rate.
Another concern I have is over licensing. The sentence collector says that all the sentences have to be public domain. Taking the first sentence from the json export:
Sentence: Pewnego razu powiedział do niej: — Istnieje jeszcze rzecz, którą zataiłem przed tobą.
On this very page it says that the work is published under CC BY 3.0 PL. I am not sure, but I believe Creative Commons are not something that falls under the term of public domain. There are 479 sentences which list the same page as the source. Checking through other sentences that list polona.pl as source I found 1584 sentences which are licensed under CC BY 3.0 PL.