Czech Wikipedia extraction concerns

I am trying to extract sentences using Wikiextractor from Czech Wikipedia and so far I have these main concerns:

  1. There are many foreign names even with generating disallowed words with high max frequency (I am currently using 100). I would generally say it is not a problem if not for the fact that people pronounce some of them differently. There is the native English pronunciation, which I see as correct. Then there are various adaptations to Czech that comprise mainly of making the words sound more closely to how they are written, like pronouncing silent graphemes, reading “th” [th] instead of [ð] etc. I fear such data in the dataset are not good for ML. Does anyone have a solution to this?

  2. Sometimes very short lines are extracted, mostly because, as already mentioned here, the tokenization at least for non-English, is very poor. Is there a setting that requires sentences of certain length, for example five characters?

  3. Is even_symbols documented anywhere? It is supposed to be a char array, but what should it contain? Pairs of matching symbols like ["(", “)”, “”", “”", …]?

Hi, to answer your questions:

  1. The dataset is mainly intended for ML STT algorithms. When people will be eventually speaking at the trained algorithms, they will have terrible pronunciation the same they do when they will submit the sentences to Common Voice. And you want to recognize those words if possible. So IMHO, having foreign words is not an issue (on the other hand, some of those words are CRAZY. Like, what the heck is “rams” supposed to mean, or “raj”? Or “aérospatiale”? All at 104)
  2. Would min_characters or min_word_count do what you are trying to accomplish?
  3. From a quick skim through the code, it seems that it needs to contain an array of symbols, and it only checks for each of those symbols if it is in the sentence exactly even number of times. So, as far as I was able to find, no way to define that for each "(" there must be exactly one ")". @mkohler might know a bit more about this, though.

A jen detail pro příště, aktuálně.cz není vhodný zdroj vět pro Common Voice, protože co je mi známo, tak si veškerá prává ke svým článkům zachovávají. Věty byly z datasetu natvrdo smazány, pokud byste však měl nějaké dodatečné materiály i do budoucna (např. výslovný souhlas se zahrnutím vět do Common Voice), nebránil bych se zpětnému zahrnutí vět těchto i jiných. Jen informaci o tomto souhlasu prosím alespoň zmiňte v políčku “source” při nahrávání vět.

Oh, and an afterthough - since words with frequency sub-100 are a huge majority of the file, it would be nearly more efficient to whitelist words instead of blacklisting them. And due to the high amount of completely valid and fine words under the boundary of 100, there just because they don’t find much use in encyclopedic style of text, as well as weird and completely out-of-place words above the boundary, it could make sense to generate the whitelist from external source, like for instance Wikidata. Or just use a preexisting dictionary file (https://crates.io/keywords/hunspell).

Thanks for the reaction. Ad 1, I partially agree, however my concern is not about incorrect pronunciation, but different pronunciations. My reasoning is that ASR models will not generalize well on such data, particularly if there are not lots of it, as is the case with rarer foreign names - and these are the ones I suppose people will mispronounce most. But if it is not a concern for Mozilla, it is not a concern for me.

Ad 2, no, min_word_count comes close, I would like something like min_characters for the whole sentence; this can be done with regexes, but I was wondering if there is a setting for this.

Ad 3, that is a shame, then the setting is just for double quotes? Also doable to some degree with regexes, but not as easy.

K aktualne, nemyslím si, že si podle zákona mohou nechat práva ke zprávám a že by na několik vět z článku šlo vztahovat autorské právo, ale nejsem právník a předběžné opatrnosti rozumím.

Good idea. Is it possible currently, or would I have to change the code?

So to point 2, there is just the setting, I have overlooked min_trimmed_length, sorry

Agree that this might be handy. I’ve filed https://github.com/Common-Voice/common-voice-wiki-scraper/issues/81 for this.

That depends on the language. For Czech that might very well just fit the quotes symbols.

That depends on the language. For Czech that might very well just fit the quotes symbols.

Except czech uses bottom quotation marks for opening a quote, and top for closing it.

Whitelisting is not possible as of now as far as I know, but it shouldn’t be too big change to the code. Definitely doable even at my level of experience, and I have never really written anything usable in rust yet.

It turns out that removing most non-alphabetic characters altogether and leaving just the most basic ones like .,- greatly reduces errors while leaving about 300k sentences. Any reason not to do this?