[Technical feedback needed] Wikipedia extractor script beta

txopi · March 2, 2020, 6:32pm

I have remembered that after generating the blacklist and before creating the scraper sentence list, we used some regular expressions just to reduce the automatic blacklist, because it contained more Basque valid words than invalid words. The reason was, as I explained, the additive properties of the language, which causes a lot of low repeated words.

So I think it would be useful for other languages with similar properties to Basque, interested in using the Wikipedia Scraper, a feature of “whitelist” or the possibility to define a list of regular expressions that avoid some kinds of words to be included in the blacklist. For example, many suffixes provoke a lot of Basque words to be included in the blacklist and the same regular expressions I used to make a manual clean, could be included in a configuration file: *gatik, *ganako, *rentzako, *rentzat, *rekin, *renganako… I used a lot and some of then I checked manually because there was a possibility of giving false positives: *ren, *ri, *ra… Obviously, the last ones can’t be included in this hypothetical parameter.
If you see it interesting for some languages different to Basque, I can create an Issue in the GitHub project, so other people can benefit of it.

nukeador · March 2, 2020, 6:34pm

@mkohler I remember we had or planned for a whitelisting feature right?

mkohler · March 2, 2020, 6:37pm

What we have is whitelisted symbols:

We do not have anything regarding whitelists for the blacklist, as that is not in our scope. That would be something for https://github.com/dabinat/cvtools.

txopi · March 2, 2020, 7:25pm

From my point of view, the issue is just that the scraper doesn’t really fit to languages like Basque. I don’t know German, but as far as I know, suffixes are common too, so perhaps German speakers can understand better the problem I’m trying to explain or can explain it better than me to other language speakers. I think I’m not explaining myself clear enough because the subject is complex and abstract.

I’m developer, I work on R&D and I don’t see a way to address Basque language needs in a programmatic way to get a good Wikipedia Scrapper. I think trying to find hundreds of regular expressions for the blacklist and for the result sentence list, is a very wrong approach. Many regular expressions would be extremely difficult to define. Many regular expressions would affect just to some little sentences, sometimes just one. That’s way I think that trying to apply JUST a programmatic strategy in all the cases will discriminate Basque language (and probably other languages if they try to use the scraper to make part of the work).

If you need to “run the extraction from your side” the only solution I see is that you just execute the scrapper with the config included in our pull request, and then match the lines of your result with the lines of our result. The lines that are equal in both sides, are good sentences and you can be sure they are inside Basque Wikipedia. This way, we will lose all the hours dedicated to make spell corrections, typos, orthography, grammar, capitalisation, commas, etc. Also the very occasional substitutions we made of some problematic foreign words with Basque ones (I don’t remember any right now, but perhaps something like this: Dostoevsky --> Etxeberria).

The people behind the Basque sentence compilation (basically me), can’t repeat the work done during the last months, and I see the programmatic approach a dead-end street, so right now I don’t see any other option to save at least part of the work done during the last months and get all the new sentences that Basque voice recordings need.

nukeador · March 2, 2020, 7:38pm

Michael has just created a chat room over Matrix just for this topic. Feel free to join so we can discuss options more sync and see how we can bring Basque extraction as soon as possible.

Thanks!

Fjoerfoks · March 4, 2020, 9:34pm

I just ran a new scraping run where I added numbers under the replacements like:
replacements = [
[" 1 ", " ien “],
[” 2 ", " twa "],
…
]

To be sure it doesn’t replace all numbers, only the loose ones, I added a space before and after the numbers. This resulted in a 1.000 sentences more on a total of 48.000 and the new sentences seem all OK.

Next up: compare the created blacklist with 250.000 words, to the Frisian dictionary from the Frisian spellchecking add-on to filter out correct words. Any nix-command which can do that?

mkohler · March 4, 2020, 10:36pm

Out of curiosity, how far did you go? And how much does that incrase the time to run the script?

Fjoerfoks · March 5, 2020, 7:21am

I went from 1-31, 40, 50-90, 100, 200-900 and 1000. I didn’t time it, but I can’t say it took way more time.
My idea now is to create some sort of whitelist with most used numbers, the opposite of the word blacklist, and put like the top 100 in the list of replacements. If someone knows the correct grep-command to fill the whitelist, that would be very helpful.

txopi · March 6, 2020, 4:02pm

I’m not sure if that’s a good idea. If there is a sentence like “In year 2001…” it will become “In year two zero zero one…”?

Fjoerfoks · March 6, 2020, 4:34pm

That’s why I added the spaces before and after the number to separate them from larger numbers like years. It works fine that way.

txopi · March 6, 2020, 4:44pm

Ah, OK, I got it! Thank you for your quick answer

hyxibg5lez · May 28, 2020, 9:36pm

I applied the extractor to Chinese wiki pages and yielded very few sentences. Since Chinese characters and punctuation are quite different from European ones, I would like to know more about the extractor before doing more experiments:

Does the extractor respect sentence separators in Chinese, e.g. “；” is equivalent to a semi-colon, “，” is equivalent to a comma, “。” is equivalent to a full stop, etc.? If no, is it possible to solve it by tweaking the rule file? I tried using “replacements” to convert the punctuation (e.g. replacements = [ ["；", "; "] ]) but the number of sentences yielded was the same.
Can the extractor handle multi-byte characters?

mkohler · May 28, 2020, 9:53pm

May I ask which Chinese wiki pages? For zh-CN a wiki extraction has already been done, so we can’t redo that: https://github.com/mozilla/voice-web/blob/master/server/data/zh-CN/wiki.zh-cn.txt

Maybe, maybe not, I can’t say right now. We’re using the punkt sentence tokenizer, I do not know off hand if that supports Chinese punctuation. Given that the (different) extractor that was used for that export is not using punkt, I wouldn’t be surprised if not: https://github.com/Common-Voice/cv-sentence-extractor/blob/mandarin/src/extractor.rs

I wouldn’t know of any issue, but of course can’t guarantee it’s bug-free.

hyxibg5lez · May 29, 2020, 1:56am

I chose a dialect of Chinese, zh-yue, which has far less wiki pages for experiment. It corresponding to zh-HK in the CV project. I got zh_yuewiki-20200520-pages-articles-multistream.xml, a 266M file, from Wikipedia. The rules file was basically copied from en.toml, and tweaked the parameters for extracting more sentence, e.g. min_trimmed_length = 0, min_word_count = 0, max_word_count = 1000, disallowed_symbols = [], etc. The result has only 247 of sentences, and only 1/10 of them are purely Chinese which are potentially useful for CV project.

If zh-cn was successful before, I would be more than happy to look at the rules file for reference.

A quick look at the source code give me the impression that it does define a set of Chinese punctuation:

static PUNCTUATIONS: [char; 37] = [
    '"', '"', '、', '‧', '—', '—', '—', '～', '“', '”', '；', '·', '：', '‘',
    '•', '─', '兀', '∶', '∧', '∨', '，', '、', '．', '；', '：', '＃', '＆',
    '＊', '＋', '－', '＜', '＞', '＝', '＄', '％', '＠', '，',
];

I notice this branch (mandarin) has no update for 10 months. What’s its status? Will it be released for extracting Chinese family languages (currently zh-cn, zh-tw, zh-hk in CV project)? Or its functionality will be merged into the master branch?

mkohler · May 29, 2020, 3:49pm

As it is right now we can’t integrate that branch as-is. However it could possibly be used for the other exports (zh-HK and zh-TW), though I have to admit that I was not involved in the previous zh-CN export and I have not looked closely at that branch and how well it could be used. @nukeador @irvin do you have thoughts here?

nukeador · May 29, 2020, 5:12pm

@fiji might have more info, but my understanding that there was some custom code made just for the zh-CN export to work that I don’t know if we have on the main branch.

irvin · May 30, 2020, 8:02am

As we had discussed in local Hong Kong CV channel, @hyxibg5lez is thinking to take advantage of the extractor to get some Cantonese sentences. He had tried it and find it’s working ~~without problems~~ with some glitches.

@nukeador I don’t think we will need to merge it into the main branch if it’s still working properly. ~~can you try it on zh-HK Wikipedia to see if it’s can export some sentences for @hyxibg5lez to check?~~

change: there are some glitches like the breaking point of sentences, the result been converted into Simplified Chinese.

irvin · May 29, 2020, 6:34pm

During the developing time (last March) @bobchao and I help @gregor to evaluate the result, but not writing the rules ourselves.

irvin · May 30, 2020, 8:05am

@nukeador If we can find local rust people helping improveing the extractor from the current zh-cn branch, can you help us to run for zh-hk after we resolve the issues once we had done?

nukeador · June 1, 2020, 12:11pm

If you come up with the rules for zh-HK I can ask to put the extraction into our queue as we have done for other languages. @mkohler what do you think?