[Technical feedback needed] Wikipedia extractor script beta

Michael has just created a chat room over Matrix just for this topic. Feel free to join so we can discuss options more sync and see how we can bring Basque extraction as soon as possible.

Thanks!

I just ran a new scraping run where I added numbers under the replacements like:
replacements = [
[" 1 ", " ien “],
[” 2 ", " twa "],

]

To be sure it doesn’t replace all numbers, only the loose ones, I added a space before and after the numbers. This resulted in a 1.000 sentences more on a total of 48.000 and the new sentences seem all OK.

Next up: compare the created blacklist with 250.000 words, to the Frisian dictionary from the Frisian spellchecking add-on to filter out correct words. Any nix-command which can do that?

Out of curiosity, how far did you go? And how much does that incrase the time to run the script?

I went from 1-31, 40, 50-90, 100, 200-900 and 1000. I didn’t time it, but I can’t say it took way more time.
My idea now is to create some sort of whitelist with most used numbers, the opposite of the word blacklist, and put like the top 100 in the list of replacements. If someone knows the correct grep-command to fill the whitelist, that would be very helpful.

I’m not sure if that’s a good idea. If there is a sentence like “In year 2001…” it will become “In year two zero zero one…”?

That’s why I added the spaces before and after the number to separate them from larger numbers like years. It works fine that way.

1 Like

Ah, OK, I got it! Thank you for your quick answer :slight_smile:

I applied the extractor to Chinese wiki pages and yielded very few sentences. Since Chinese characters and punctuation are quite different from European ones, I would like to know more about the extractor before doing more experiments:

  1. Does the extractor respect sentence separators in Chinese, e.g. “;” is equivalent to a semi-colon, “,” is equivalent to a comma, “。” is equivalent to a full stop, etc.? If no, is it possible to solve it by tweaking the rule file? I tried using “replacements” to convert the punctuation (e.g. replacements = [ [";", "; "] ]) but the number of sentences yielded was the same.

  2. Can the extractor handle multi-byte characters?

May I ask which Chinese wiki pages? For zh-CN a wiki extraction has already been done, so we can’t redo that: https://github.com/mozilla/voice-web/blob/master/server/data/zh-CN/wiki.zh-cn.txt

Maybe, maybe not, I can’t say right now. We’re using the punkt sentence tokenizer, I do not know off hand if that supports Chinese punctuation. Given that the (different) extractor that was used for that export is not using punkt, I wouldn’t be surprised if not: https://github.com/Common-Voice/cv-sentence-extractor/blob/mandarin/src/extractor.rs

I wouldn’t know of any issue, but of course can’t guarantee it’s bug-free. :slight_smile:

I chose a dialect of Chinese, zh-yue, which has far less wiki pages for experiment. It corresponding to zh-HK in the CV project. I got zh_yuewiki-20200520-pages-articles-multistream.xml, a 266M file, from Wikipedia. The rules file was basically copied from en.toml, and tweaked the parameters for extracting more sentence, e.g. min_trimmed_length = 0, min_word_count = 0, max_word_count = 1000, disallowed_symbols = [], etc. The result has only 247 of sentences, and only 1/10 of them are purely Chinese which are potentially useful for CV project.

If zh-cn was successful before, I would be more than happy to look at the rules file for reference.

A quick look at the source code give me the impression that it does define a set of Chinese punctuation:

static PUNCTUATIONS: [char; 37] = [
    '"', '"', '、', '‧', '—', '—', '—', '~', '“', '”', ';', '·', ':', '‘',
    '•', '─', '兀', '∶', '∧', '∨', ',', '、', '.', ';', ':', '#', '&',
    '*', '+', '-', '<', '>', '=', '$', '%', '@', ',',
];

I notice this branch (mandarin) has no update for 10 months. What’s its status? Will it be released for extracting Chinese family languages (currently zh-cn, zh-tw, zh-hk in CV project)? Or its functionality will be merged into the master branch?

As it is right now we can’t integrate that branch as-is. However it could possibly be used for the other exports (zh-HK and zh-TW), though I have to admit that I was not involved in the previous zh-CN export and I have not looked closely at that branch and how well it could be used. @nukeador @irvin do you have thoughts here?

@fiji might have more info, but my understanding that there was some custom code made just for the zh-CN export to work that I don’t know if we have on the main branch.

As we had discussed in local Hong Kong CV channel, @hyxibg5lez is thinking to take advantage of the extractor to get some Cantonese sentences. He had tried it and find it’s working without problems with some glitches.

@nukeador I don’t think we will need to merge it into the main branch if it’s still working properly. can you try it on zh-HK Wikipedia to see if it’s can export some sentences for @hyxibg5lez to check?

change: there are some glitches like the breaking point of sentences, the result been converted into Simplified Chinese.

During the developing time (last March) @bobchao and I help @gregor to evaluate the result, but not writing the rules ourselves.

@nukeador If we can find local rust people helping improveing the extractor from the current zh-cn branch, can you help us to run for zh-hk after we resolve the issues once we had done?

If you come up with the rules for zh-HK I can ask to put the extraction into our queue as we have done for other languages. @mkohler what do you think?

The branch had been diverted for too far, so we’re thinking of modifying the current mandarin branch directly for zh-hk with the help from the local rust community.

I agree with that (as already stated in the PR). Looking forward to reviewing the PR once all necessary changes have been made.

We are moving to this topic to list all the things about the sentence extractor