I am trying to extract sentences using Wikiextractor from Czech Wikipedia and so far I have these main concerns:
-
There are many foreign names even with generating disallowed words with high max frequency (I am currently using 100). I would generally say it is not a problem if not for the fact that people pronounce some of them differently. There is the native English pronunciation, which I see as correct. Then there are various adaptations to Czech that comprise mainly of making the words sound more closely to how they are written, like pronouncing silent graphemes, reading “th” [th] instead of [ð] etc. I fear such data in the dataset are not good for ML. Does anyone have a solution to this?
-
Sometimes very short lines are extracted, mostly because, as already mentioned here, the tokenization at least for non-English, is very poor. Is there a setting that requires sentences of certain length, for example five characters?
-
Is even_symbols documented anywhere? It is supposed to be a char array, but what should it contain? Pairs of matching symbols like ["(", “)”, “”", “”", …]?