[Technical feedback needed] Wikipedia extractor script beta

txopi · March 1, 2020, 8:18pm

OK, I’ll try to explain myself better

Using the scrapper with the blacklist and some regular expressions for abbreviations and some more words, we soon saw that this way we weren’t able to achieve an acceptable result for Basque. After filtering strange characters, etc. about 50 % of the sentences were problematic, no matter too much which repetition criteria we choose. That’s because Basque Wikipedia isn’t as big as other ones, so many articles are quite short and full of people and place names, and because Basque language itself is composed of many longs words with low repetition rates: verb declension, re-declension, composed words, prefixes, a pile of suffixes… The problem was that many foreign words in the sentences didn’t follow Basque language rules and will ruin the phonetic models so we needed to remove them (we have kept the sentences with foreign words compatible with Basque).

We needed to divide the work between the volunteers and most of them didn’t know regular expressions, so I took a text editor and made a first clean using about 200-300 searches using regular expressions of just words or parts of words used in problematic foreign words. Remember that Basque uses a lot of prefixes and suffixes, so creating perfect regular expressions for all the cases would be a nightmare. This way, I just decided on the go if each sentence should be removed or not, and many potential regular expressions went disappearing as the wrong sentences where less and less. Thanks to this iterative process, I removed many sentences in a short time (but I didn’t use a list of perfect regular expressions that can be reused).

Then, other volunteers continued the work, fixing sentences with spell problems, removing of changing sentences with problematic foreign words, etc. Finally, from a collection of 110.000 de-duplicated sentences gave by the scrapper, we got 55.000 validated sentences. Did I explained myself now? Do you understand how after a horrible beginning (about 50% wrong), we got a quite good ending (about 1-2% wrong)?

I don’t know if the steps we took were the best, the optimal, etc. I just know that Basque Wikipedia is quite little and we got many new sentences for Common Voice project.

mkohler · March 1, 2020, 9:22pm

Thanks, now I got it.

nukeador · March 2, 2020, 3:37pm

I suspect these manual filtering after the extraction might surface some issues with our current process.

We need to be able to run the extraction from our side (using blacklist and rules), and then you can ask for removals or fixes to the final list of sentences, but not adding any new sentences.

This is the only way we have to control that our legal constrains are enforced by the scripts.

Can you come with a proposal with these constrains in mind so we can move forward?

Thanks!

txopi · March 2, 2020, 6:32pm

I have remembered that after generating the blacklist and before creating the scraper sentence list, we used some regular expressions just to reduce the automatic blacklist, because it contained more Basque valid words than invalid words. The reason was, as I explained, the additive properties of the language, which causes a lot of low repeated words.

So I think it would be useful for other languages with similar properties to Basque, interested in using the Wikipedia Scraper, a feature of “whitelist” or the possibility to define a list of regular expressions that avoid some kinds of words to be included in the blacklist. For example, many suffixes provoke a lot of Basque words to be included in the blacklist and the same regular expressions I used to make a manual clean, could be included in a configuration file: *gatik, *ganako, *rentzako, *rentzat, *rekin, *renganako… I used a lot and some of then I checked manually because there was a possibility of giving false positives: *ren, *ri, *ra… Obviously, the last ones can’t be included in this hypothetical parameter.
If you see it interesting for some languages different to Basque, I can create an Issue in the GitHub project, so other people can benefit of it.

nukeador · March 2, 2020, 6:34pm

@mkohler I remember we had or planned for a whitelisting feature right?

mkohler · March 2, 2020, 6:37pm

What we have is whitelisted symbols:

We do not have anything regarding whitelists for the blacklist, as that is not in our scope. That would be something for https://github.com/dabinat/cvtools.

txopi · March 2, 2020, 7:25pm

From my point of view, the issue is just that the scraper doesn’t really fit to languages like Basque. I don’t know German, but as far as I know, suffixes are common too, so perhaps German speakers can understand better the problem I’m trying to explain or can explain it better than me to other language speakers. I think I’m not explaining myself clear enough because the subject is complex and abstract.

I’m developer, I work on R&D and I don’t see a way to address Basque language needs in a programmatic way to get a good Wikipedia Scrapper. I think trying to find hundreds of regular expressions for the blacklist and for the result sentence list, is a very wrong approach. Many regular expressions would be extremely difficult to define. Many regular expressions would affect just to some little sentences, sometimes just one. That’s way I think that trying to apply JUST a programmatic strategy in all the cases will discriminate Basque language (and probably other languages if they try to use the scraper to make part of the work).

If you need to “run the extraction from your side” the only solution I see is that you just execute the scrapper with the config included in our pull request, and then match the lines of your result with the lines of our result. The lines that are equal in both sides, are good sentences and you can be sure they are inside Basque Wikipedia. This way, we will lose all the hours dedicated to make spell corrections, typos, orthography, grammar, capitalisation, commas, etc. Also the very occasional substitutions we made of some problematic foreign words with Basque ones (I don’t remember any right now, but perhaps something like this: Dostoevsky --> Etxeberria).

The people behind the Basque sentence compilation (basically me), can’t repeat the work done during the last months, and I see the programmatic approach a dead-end street, so right now I don’t see any other option to save at least part of the work done during the last months and get all the new sentences that Basque voice recordings need.

nukeador · March 2, 2020, 7:38pm

Michael has just created a chat room over Matrix just for this topic. Feel free to join so we can discuss options more sync and see how we can bring Basque extraction as soon as possible.

Thanks!

Fjoerfoks · March 4, 2020, 9:34pm

I just ran a new scraping run where I added numbers under the replacements like:
replacements = [
[" 1 ", " ien “],
[” 2 ", " twa "],
…
]

To be sure it doesn’t replace all numbers, only the loose ones, I added a space before and after the numbers. This resulted in a 1.000 sentences more on a total of 48.000 and the new sentences seem all OK.

Next up: compare the created blacklist with 250.000 words, to the Frisian dictionary from the Frisian spellchecking add-on to filter out correct words. Any nix-command which can do that?

mkohler · March 4, 2020, 10:36pm

Out of curiosity, how far did you go? And how much does that incrase the time to run the script?

Fjoerfoks · March 5, 2020, 7:21am

I went from 1-31, 40, 50-90, 100, 200-900 and 1000. I didn’t time it, but I can’t say it took way more time.
My idea now is to create some sort of whitelist with most used numbers, the opposite of the word blacklist, and put like the top 100 in the list of replacements. If someone knows the correct grep-command to fill the whitelist, that would be very helpful.

txopi · March 6, 2020, 4:02pm

I’m not sure if that’s a good idea. If there is a sentence like “In year 2001…” it will become “In year two zero zero one…”?

Fjoerfoks · March 6, 2020, 4:34pm

That’s why I added the spaces before and after the number to separate them from larger numbers like years. It works fine that way.

txopi · March 6, 2020, 4:44pm

Ah, OK, I got it! Thank you for your quick answer

hyxibg5lez · May 28, 2020, 9:36pm

I applied the extractor to Chinese wiki pages and yielded very few sentences. Since Chinese characters and punctuation are quite different from European ones, I would like to know more about the extractor before doing more experiments:

Does the extractor respect sentence separators in Chinese, e.g. “；” is equivalent to a semi-colon, “，” is equivalent to a comma, “。” is equivalent to a full stop, etc.? If no, is it possible to solve it by tweaking the rule file? I tried using “replacements” to convert the punctuation (e.g. replacements = [ ["；", "; "] ]) but the number of sentences yielded was the same.
Can the extractor handle multi-byte characters?

mkohler · May 28, 2020, 9:53pm

May I ask which Chinese wiki pages? For zh-CN a wiki extraction has already been done, so we can’t redo that: https://github.com/mozilla/voice-web/blob/master/server/data/zh-CN/wiki.zh-cn.txt

Maybe, maybe not, I can’t say right now. We’re using the punkt sentence tokenizer, I do not know off hand if that supports Chinese punctuation. Given that the (different) extractor that was used for that export is not using punkt, I wouldn’t be surprised if not: https://github.com/Common-Voice/cv-sentence-extractor/blob/mandarin/src/extractor.rs

I wouldn’t know of any issue, but of course can’t guarantee it’s bug-free.

hyxibg5lez · May 29, 2020, 1:56am

I chose a dialect of Chinese, zh-yue, which has far less wiki pages for experiment. It corresponding to zh-HK in the CV project. I got zh_yuewiki-20200520-pages-articles-multistream.xml, a 266M file, from Wikipedia. The rules file was basically copied from en.toml, and tweaked the parameters for extracting more sentence, e.g. min_trimmed_length = 0, min_word_count = 0, max_word_count = 1000, disallowed_symbols = [], etc. The result has only 247 of sentences, and only 1/10 of them are purely Chinese which are potentially useful for CV project.

If zh-cn was successful before, I would be more than happy to look at the rules file for reference.

A quick look at the source code give me the impression that it does define a set of Chinese punctuation:

static PUNCTUATIONS: [char; 37] = [
    '"', '"', '、', '‧', '—', '—', '—', '～', '“', '”', '；', '·', '：', '‘',
    '•', '─', '兀', '∶', '∧', '∨', '，', '、', '．', '；', '：', '＃', '＆',
    '＊', '＋', '－', '＜', '＞', '＝', '＄', '％', '＠', '，',
];

I notice this branch (mandarin) has no update for 10 months. What’s its status? Will it be released for extracting Chinese family languages (currently zh-cn, zh-tw, zh-hk in CV project)? Or its functionality will be merged into the master branch?

mkohler · May 29, 2020, 3:49pm

As it is right now we can’t integrate that branch as-is. However it could possibly be used for the other exports (zh-HK and zh-TW), though I have to admit that I was not involved in the previous zh-CN export and I have not looked closely at that branch and how well it could be used. @nukeador @irvin do you have thoughts here?

nukeador · May 29, 2020, 5:12pm

@fiji might have more info, but my understanding that there was some custom code made just for the zh-CN export to work that I don’t know if we have on the main branch.

irvin · May 30, 2020, 8:02am

As we had discussed in local Hong Kong CV channel, @hyxibg5lez is thinking to take advantage of the extractor to get some Cantonese sentences. He had tried it and find it’s working ~~without problems~~ with some glitches.

@nukeador I don’t think we will need to merge it into the main branch if it’s still working properly. ~~can you try it on zh-HK Wikipedia to see if it’s can export some sentences for @hyxibg5lez to check?~~

change: there are some glitches like the breaking point of sentences, the result been converted into Simplified Chinese.