[Technical feedback needed] Wikipedia extractor script beta

stergro · January 4, 2020, 8:56pm

Hey @Fjoerfoks the best way to understand it is to look at other rulefiles, for example the german.tombl explains a lot in the comment above of the abbreviation_patterns. (only the third line are real abbreviations!)

In general one can say that you seperate different abbreviations with a pipe and then you write the abbreviation with escaped full stops. For example:

“a\\.u\.b\\.|v\\.Chr\\.”

I hope this helps a little.

iLeonidze · January 5, 2020, 2:37pm

Started a new PR for Russian. Please, someone who speaks Russian, review the sample file provided.

I will also make some minor fixes related to checks perfomance.

Fjoerfoks · January 6, 2020, 8:41am

Ah, perfect! I can work with that.
Are there any limits to the number of characters on 1 line?
Are these case sensitive?

Fjoerfoks · January 17, 2020, 12:48pm

Raised this also on Github, but like to put the following here too:
Can someone tell me what the status is of abbreviations while scraping sentences?
Are we going to allow them for scraping?

If no, how to skip these sentences (being a “.” being not the end of a line)?
If yes, do we keep them within the sentences ‘as is’, and let the user read the full explanation of the abbreviation (e.g.: “etc.” becomes spoken “etcetera”), or
if yes, are we going to do a search and replace somewhere during the scrapingprocess?

I really like to know before proceeding with the next run.

Fjoerfoks · January 12, 2020, 3:44pm

Last time I ran the Dutch scraping it seemed like it would take around 2 days to finish on my laptop with 6 GB RAM, AMD A8-6410.
This made me think: is it an option for Mozilla to fire up a Virtual Machine with all the prerequisites needed for scraping, which has lots of memory and RAM to speed up this process. Localizers only would need to have SSH-access though e.g. Putty.
Just a thought…

mkohler · January 12, 2020, 4:08pm

I’m aware that tokenization can be pretty bad. This is how I solved it for German: https://github.com/Common-Voice/common-voice-wiki-scraper/blob/master/src/rules/german.toml#L22. If you have any good ideas for this, please add them to https://github.com/Common-Voice/common-voice-wiki-scraper/issues/11.

It’s hard to get people to do that, so let’s avoid anything like that as far as possible.

Let’s give our best to avoid that.

IMHO it’s fine if you stop it after some time, I don’t think we need the full output for verification before we merge and run it officially.

Given the above, I don’t think this is needed. However there is https://github.com/Common-Voice/common-voice-wiki-scraper/issues/18, but I doubt we can do it for the full export to make sure we can provide timely feedback. However, for smaller chunks we probably can do it and that should IMHO be enough for feedback for validation.

I’ll put this on my list again, will be done after https://github.com/Common-Voice/common-voice-wiki-scraper/issues/9 and https://github.com/Common-Voice/common-voice-wiki-scraper/issues/14.

Fjoerfoks · February 11, 2020, 10:32am

I think @nukeador raised an issue about how to deal with the large numbers of sentences and how to get these validated.
After using CV for a while, both donating as well as validating, I think it is an option to let the community validate the sentences while speaking. If a user detects a sentence which is not valid, there should be an option to reject the sentence. With 2 rejects, it would be moved to the “graveyard” for an update or total deletion.
What do you think?

nukeador · February 11, 2020, 12:42pm

Right now we have the report feature that it’s trying to solve the same problem. We just need to create a process to review reports and apply removals when needed.

Fjoerfoks · February 27, 2020, 9:46pm

I have this regex in my frisian.toml:
“[A-Z]+\.*[A-Z]”,
but still see names appear in the sentences like “A. H. Draper”.
How can I eliminate these?

dabinat · February 28, 2020, 12:35am

Try:

[A-Z]\.\ [A-Z][a-z]+

comodoro · February 28, 2020, 8:35pm

The Czech pull request (https://github.com/Common-Voice/common-voice-wiki-scraper/pull/90) is kind of stuck, could some Czechs and/or repository maintainers please comment? By the way the rules are quite restrictive to minimize errors (quite a few arise from tokenization), I think perhaps much more can be harvested, at the cost of increasing the error rate.

Fjoerfoks · March 1, 2020, 9:21am

Well that doesn’t work although the expression is right. After running some tests, I finally decided to skip all abbreviations with:
abbreviation_patterns = [
“\. +[a-zA-Z|û]”,
“\.+[a-zA-Z|û]”,
]
It really speeds up the process and doesn’t have any effect on the total number of sentences.

Fjoerfoks · March 1, 2020, 9:20am

Where have the numbers gone?
After reviewing the result, I came to the conclusion that there are no numbers as digits like 1, 2, 300, etc., at all in the sentences. They are in the wiki.[locale].all.txt, but not in the final result.
I don’t think this is right and digits should be added so in the future we can order 3 pizza’s.

txopi · March 1, 2020, 10:27am

We finally prepared a new Basque sentence set thanks to the scrapper. They are 55.031 new sentences.

To create the blacklist we used <20 repetitions criteria. Basque language is additive and uses a lot of suffixes (many similar words with low repetition rate), so bigger numbers just reduced the size of the result but not the quality of it. We got 110.000 de-duplicated sentences, then reduced it using many regular expressions (foreign words, wrong characters/words…) and finally cleaned and fixed it manually (spelling errors, concordance, more foreign words…). 5 people did this work. Here it is the pull request for Basque in the Scraper project: https://github.com/Common-Voice/common-voice-wiki-scraper/pull/95

Finally, a 6th person (a Basque language teacher), revised 550 random sentences of the previous result and found errors in the 2% of the sentences. Some of them, like the lack of some commas, not very important for the aim of this project.

Here it is the result: https://librezale.eus/mediawiki//images/2/2b/CommonVoicerako-esaldiak3-Wikipedia.txt

Which is the next step we should take? Can you load this sentences without making us validate them again 5 by 5 in the Sentence Validator (we already did that work carefully!)?

mkohler · March 1, 2020, 11:32am

I’m not sure I understand this. How many sentences do you get just by running the export with your current rule set? These regular expressions are part of https://github.com/Common-Voice/common-voice-wiki-scraper/pull/95/files#diff-110f3d80ffac0f64d1a7d81935e710bbR28, correct?

What is the error rate without any manual work applied?

txopi · March 1, 2020, 8:18pm

OK, I’ll try to explain myself better

Using the scrapper with the blacklist and some regular expressions for abbreviations and some more words, we soon saw that this way we weren’t able to achieve an acceptable result for Basque. After filtering strange characters, etc. about 50 % of the sentences were problematic, no matter too much which repetition criteria we choose. That’s because Basque Wikipedia isn’t as big as other ones, so many articles are quite short and full of people and place names, and because Basque language itself is composed of many longs words with low repetition rates: verb declension, re-declension, composed words, prefixes, a pile of suffixes… The problem was that many foreign words in the sentences didn’t follow Basque language rules and will ruin the phonetic models so we needed to remove them (we have kept the sentences with foreign words compatible with Basque).

We needed to divide the work between the volunteers and most of them didn’t know regular expressions, so I took a text editor and made a first clean using about 200-300 searches using regular expressions of just words or parts of words used in problematic foreign words. Remember that Basque uses a lot of prefixes and suffixes, so creating perfect regular expressions for all the cases would be a nightmare. This way, I just decided on the go if each sentence should be removed or not, and many potential regular expressions went disappearing as the wrong sentences where less and less. Thanks to this iterative process, I removed many sentences in a short time (but I didn’t use a list of perfect regular expressions that can be reused).

Then, other volunteers continued the work, fixing sentences with spell problems, removing of changing sentences with problematic foreign words, etc. Finally, from a collection of 110.000 de-duplicated sentences gave by the scrapper, we got 55.000 validated sentences. Did I explained myself now? Do you understand how after a horrible beginning (about 50% wrong), we got a quite good ending (about 1-2% wrong)?

I don’t know if the steps we took were the best, the optimal, etc. I just know that Basque Wikipedia is quite little and we got many new sentences for Common Voice project.

mkohler · March 1, 2020, 9:22pm

Thanks, now I got it.

nukeador · March 2, 2020, 3:37pm

I suspect these manual filtering after the extraction might surface some issues with our current process.

We need to be able to run the extraction from our side (using blacklist and rules), and then you can ask for removals or fixes to the final list of sentences, but not adding any new sentences.

This is the only way we have to control that our legal constrains are enforced by the scripts.

Can you come with a proposal with these constrains in mind so we can move forward?

Thanks!

txopi · March 2, 2020, 6:32pm

I have remembered that after generating the blacklist and before creating the scraper sentence list, we used some regular expressions just to reduce the automatic blacklist, because it contained more Basque valid words than invalid words. The reason was, as I explained, the additive properties of the language, which causes a lot of low repeated words.

So I think it would be useful for other languages with similar properties to Basque, interested in using the Wikipedia Scraper, a feature of “whitelist” or the possibility to define a list of regular expressions that avoid some kinds of words to be included in the blacklist. For example, many suffixes provoke a lot of Basque words to be included in the blacklist and the same regular expressions I used to make a manual clean, could be included in a configuration file: *gatik, *ganako, *rentzako, *rentzat, *rekin, *renganako… I used a lot and some of then I checked manually because there was a possibility of giving false positives: *ren, *ri, *ra… Obviously, the last ones can’t be included in this hypothetical parameter.
If you see it interesting for some languages different to Basque, I can create an Issue in the GitHub project, so other people can benefit of it.

nukeador · March 2, 2020, 6:34pm

@mkohler I remember we had or planned for a whitelisting feature right?