Polish sentences concerns

Adrijaned · January 17, 2020, 2:17pm

CC-BY-3.0 is definitely not a license these sentences should be licensed under, only permittable license is CC-0 or equivalent. If you could provide the list of all the sentences you found were provided under wrong license (ideally sentence per line), or at least the data of the specific imports (author and source), it would be immensely helpful.

@mkohler

nukeador · January 17, 2020, 10:11pm

If we have the authors identified, @mkohler can help do clean up all the sentences.

Thanks so much for reporting!

As a reminder, we would love Polish community to help with [Technical feedback needed] Wikipedia extractor script beta

Scarfmonster · January 17, 2020, 11:50pm

I identified all sentences with these sources as definitely having a wrong licence:

https:// polona.pl/item/niespokojni,NTM0OTAxMzc/178/#info:metadata
https:// polona.pl/item/niespokojni,NTM0OTAxMzc/119/#info:metadata
https:// polona.pl/item/niespokojni,NTM0OTAxMzc/68/#info:metadata
https:// polona.pl/item/niespokojni,NTM0OTAxMzc/31/#info:metadata
https:// polona.pl/item/niespokojni,NTM0OTAxMzc/20/#info:metadata
https:// polona.pl/item/niespokojni,NTM0OTAxMzc/11/#info:metadata
Niebezpiecznik.pl
https:// sjp.pwn.pl/slowniki/jak-si%C4%99-masz.html

These are all listed here: https://gist.github.com/Scarfmonster/241ec91ee1c0fe76ec3ad3cf7229ea35

In addition, some sentences list opensubtitles as the source. I wasn’t sure about the licensing here, so I included them as a separate list…

jakub.wrobel7 · January 20, 2020, 7:46pm

I was absent for some time and was quite surprised to see such jump in polish sentence-base. I still have quite a lot sentences I parsed from wolnelektury.pl sometime ago and not uploaded them yet. Do You think it is still worth uploading them @Scarfmonster? I can also throw them on github so someone else can do it in parallel. I will try to check the wiki scrip this weekend @nukeador when I get some time for it.

nukeador · January 21, 2020, 12:47pm

Quick note: We are seeing thousands of new clips and listens on Polish today.

Having in mind there is currently only 7400 sentences, that will allow only for approx. 10 validated hours without repetitions.

If we expect similar flows of people in the short term, we should try to get the wikipedia extraction done sooner to avoid people recording the same sentence more than once (which is not really useful for STT model training)

Thanks again for your efforts and contributions!

Georgrio_S · January 21, 2020, 2:56pm

Its becouse someone threw link on polish reddit like site called wykop. So called wykop effect

stergro · January 21, 2020, 4:03pm

Here is the thread, might be good to collect all the feedback (over 180 comments):

I don’t know much polish, but there are definitely a few posts about punctuation problems and some people are posting sentence examples.

I created a few user spikes by posting about common voice on Reddit, but never something that had such a big impact. Congrats for whoever posted this.

Scarfmonster · January 21, 2020, 10:09pm

@jakub.wrobel7
I believe it would be good to steadily upload more sentences to the sentence collector as the queue goes down. Even with the wiki extractor it’s always good to have more sentences from more different sources.

@nukeador
I am currently evaluating the wiki scraper for Polish. I managed to more or less work out a couple of regexes to filter the abbreviation patterns, but got stuck on replacements.
Specifically, the problem is that some abbreviations would be extremely useful to expand, because otherwise there are useful sentences filtered out, which are unlikely to be written differently. One issue with expanding the abbreviations though is Polish has declension, but all word forms usually map to a single abbreviation. This is not a problem as the amount of filtered sentences doesn’t seem to be too much, but I am worried that some types of sentences may be completely excluded this way.

Similar example - just about all sentences mentioning a specific century since we almost exclusively use Roman numerals for that. It is about 5% of all sentences in my sample of 350000.

Also, @jakub.wrobel7 this is my current config for Polish in the extractor.

The abbreviation_patterns list needs expanding because it is missing a lot of abbreviations which do not end with a period, along with 3+ letter ones. It should have all other types of abbreviations handled though.

github.com

Scarfmonster/common-voice-wiki-scraper/blob/polish/src/rules/polish.toml

min_trimmed_length = 2
min_word_count = 1
max_word_count = 14
min_characters = 3
may_end_with_colon = false
quote_start_with_letter = true
needs_punctuation_end = false
needs_letter_start = true
needs_uppercase_start = true
broken_whitespace = ["  ", " ,", " .", " ?", " !", " ;", " :", "( ", " )"]
allowed_symbols_regex = "[a-ząćęłńóśżźA-ZĄĆĘŁŃÓŚŻŹ ,.?!:\\-–—\"'„“”\\(\\)]"

replacements = [
  ["p.n.e.", "przed naszą erą"],
  ["n.e.", "naszej ery"],
  ["n.p.m.", "nad poziomem morza"]
]

abbreviation_patterns = [
  "[A-ZĄĆĘŁŃÓŚŻŹ]{2,}",

This file has been truncated. show original

@stergro
Most complaints are about people who can’t read the sentences properly, substitute different words, ignore punctuation and generally read sentences like a string of letters. The screenshots of sentences are just ones which are amusing out of context. There are some which are inappropriate though. These come from the book Niespokojni and should be removed anyway due to wrong licensing.

There is also some misunderstanding of what the goal of the project is. A lot of people don’t know the resulting dataset is going to be freely available. To them it looks like a lot of people are suddenly doing free work to create another closed source product.

nukeador · January 22, 2020, 12:28pm

Great progress!

I suggest you ask your questions about the script directly on the extractor topic, a lot of people who have been doing extractions are tracking that one and will be able to assist.

Thanks!

Etua · January 22, 2020, 10:59pm

I have noticed that some sentences start with “-” which is something that most if not all people will ignore while reading but I suppose that for the sake of data quality it would be advisable to drop it even without removing existing recordings. Actually most of the affected strings are still in the review so we can edit them before they will be added to the main pool.

jakub.wrobel7 · January 23, 2020, 7:34pm

You are probably referring to sentences that are actually part of dialogue like
“- Hey! - someone shouted.”
I can remove those at the beginning in my future uploads of CC0 books pieces but it looks a bit off (it might be just me):
“Hey! - someone shouted.”
Also, I do not know if it actually brings any quality improvement. Any more thoughts?

stergro · January 23, 2020, 7:41pm

Right now Deepspeech deletes all signs that are not letters before training so it doesn’t matter at all. This might change in the future.

jakub.wrobel7 · January 29, 2020, 10:24pm

I forked git repo provided by @Scarfmonster and tried the scrapping. It did pretty well as for first try. I removed ( and ) from allowed symbols as many sentences seemed broken by those (i.e. some details in the middle of sentence surrounded with parenthesis). I will run blacklist operations overnight and will put word statistics in forked repo if the file will be of reasonable size. @Scarfmonster should we make separate topic for work on polish extractor settings? I noticed also this topic: Using the Europarl Dataset with sentences from speeches from the European Parliament - Common Voice - Mozilla Discourse - for future reference.

jakub.wrobel7 · February 5, 2020, 9:04pm

Created this https://discourse.mozilla.org/t/coordination-of-input-for-polish-language-wiki-scrapper/53380 to achieve some manner of organization

jakub.wrobel7 · February 6, 2020, 9:02pm

I think I have found another sentences originating not from CC0 sources:

“source”: “From the book.”, “username”: “narid” – seems to be actually book “Biały Kieł” not CC0 AFAIK
“source”: “https://www.gutenberg.org/files/6000/6000-h/6000-h.htm Project Gutenberg version of “Ironia Pozorow” by Maciej hr. Lubienski”, “username”: “hellbunnie” - license here is somewhat about freedom but probably not CC0 equivalent
“source”: “https://pl.wikipedia.org/wiki/Zachęta_Narodowa_Galeria_Sztuki”, “username”: “michalstepien” - wikipedia originating sentences

Maybe it would be nice to add source information to reviewing process for sentences?
Can any other polish contributor perform a second check before deletion?

Adrijaned · February 7, 2020, 1:56pm

White fang’s original author, Jack London, has been dead for long enough now for his works to be in public domain, but translator of the specific edition is probably more important for common voice purposes, and I was not able to find that. Do you know the translator of the specific edition the sentences were taken from so you could check (you generally want him to be dead for more than 70 years for his work to fall into public domain)
Author of “Ironia Pozorow” is still alive as far as I was able to find, and I couldn’t find any information about him releasing the book into public domain - how it ended up on gutenberg in the first place is a slight mystery to me.
If it comes from Wikipedia, it needs to be gone.
Could you please re-check the Biały Kieł, and then post all of these into Sentence collector copyright issues for bookkeeping?

jakub.wrobel7 · February 7, 2020, 9:27pm

Done https://discourse.mozilla.org/t/sentence-collector-copyright-issues/52767/3?u=jakub.wrobel7

stergro · May 2, 2020, 11:31am

Hey @Scarfmonster Polish has now recorded 100 hours with many repetitions. Are you still working on the Wikipedia export? I think in a situation like this it would be better to import quickly and remove bad sentences afterwards.

Scarfmonster · May 2, 2020, 1:41pm

I haven’t had much time to work on the export. The biggest issue here is with rust-punkt. My knowledge of Rust is very limited but from what I understand Rust-punkt was coded to imitate how the NLTK Punkt works, but Rust and Python handle utf-8 string slices very differently. In rust-punkt training doesn’t work on non-ascii texts at all, because it will try to slice “not at char boundary”.

Removing bad sentences after export may be hard because abbreviations are very common in Wikipedia sentences, and rust-punkt currently splits sentences at every occurrence of a period in text. It’s doable, but will require a lot of work.

stergro · May 4, 2020, 6:27am

I am not so sure about this, other languages with non ASCII characters are working very well. What kind of problem do you experience exactly?