[Technical feedback needed] Wikipedia extractor script beta

Try:

[A-Z]\.\ [A-Z][a-z]+

The Czech pull request (https://github.com/Common-Voice/common-voice-wiki-scraper/pull/90) is kind of stuck, could some Czechs and/or repository maintainers please comment? By the way the rules are quite restrictive to minimize errors (quite a few arise from tokenization), I think perhaps much more can be harvested, at the cost of increasing the error rate.

Well that doesn’t work although the expression is right. After running some tests, I finally decided to skip all abbreviations with:
abbreviation_patterns = [
“\. +[a-zA-Z|û]”,
“\.+[a-zA-Z|û]”,
]
It really speeds up the process and doesn’t have any effect on the total number of sentences.

Where have the numbers gone?
After reviewing the result, I came to the conclusion that there are no numbers as digits like 1, 2, 300, etc., at all in the sentences. They are in the wiki.[locale].all.txt, but not in the final result.
I don’t think this is right and digits should be added so in the future we can order 3 pizza’s.

We finally prepared a new Basque sentence set thanks to the scrapper. They are 55.031 new sentences.

To create the blacklist we used <20 repetitions criteria. Basque language is additive and uses a lot of suffixes (many similar words with low repetition rate), so bigger numbers just reduced the size of the result but not the quality of it. We got 110.000 de-duplicated sentences, then reduced it using many regular expressions (foreign words, wrong characters/words…) and finally cleaned and fixed it manually (spelling errors, concordance, more foreign words…). 5 people did this work. Here it is the pull request for Basque in the Scraper project: https://github.com/Common-Voice/common-voice-wiki-scraper/pull/95

Finally, a 6th person (a Basque language teacher), revised 550 random sentences of the previous result and found errors in the 2% of the sentences. Some of them, like the lack of some commas, not very important for the aim of this project.

Here it is the result: https://librezale.eus/mediawiki//images/2/2b/CommonVoicerako-esaldiak3-Wikipedia.txt

Which is the next step we should take? Can you load this sentences without making us validate them again 5 by 5 in the Sentence Validator (we already did that work carefully!)?

I’m not sure I understand this. How many sentences do you get just by running the export with your current rule set? These regular expressions are part of https://github.com/Common-Voice/common-voice-wiki-scraper/pull/95/files#diff-110f3d80ffac0f64d1a7d81935e710bbR28, correct?

What is the error rate without any manual work applied?

OK, I’ll try to explain myself better :slight_smile:

Using the scrapper with the blacklist and some regular expressions for abbreviations and some more words, we soon saw that this way we weren’t able to achieve an acceptable result for Basque. After filtering strange characters, etc. about 50 % of the sentences were problematic, no matter too much which repetition criteria we choose. That’s because Basque Wikipedia isn’t as big as other ones, so many articles are quite short and full of people and place names, and because Basque language itself is composed of many longs words with low repetition rates: verb declension, re-declension, composed words, prefixes, a pile of suffixes… The problem was that many foreign words in the sentences didn’t follow Basque language rules and will ruin the phonetic models so we needed to remove them (we have kept the sentences with foreign words compatible with Basque).

We needed to divide the work between the volunteers and most of them didn’t know regular expressions, so I took a text editor and made a first clean using about 200-300 searches using regular expressions of just words or parts of words used in problematic foreign words. Remember that Basque uses a lot of prefixes and suffixes, so creating perfect regular expressions for all the cases would be a nightmare. This way, I just decided on the go if each sentence should be removed or not, and many potential regular expressions went disappearing as the wrong sentences where less and less. Thanks to this iterative process, I removed many sentences in a short time (but I didn’t use a list of perfect regular expressions that can be reused).

Then, other volunteers continued the work, fixing sentences with spell problems, removing of changing sentences with problematic foreign words, etc. Finally, from a collection of 110.000 de-duplicated sentences gave by the scrapper, we got 55.000 validated sentences. Did I explained myself now? Do you understand how after a horrible beginning (about 50% wrong), we got a quite good ending (about 1-2% wrong)?

I don’t know if the steps we took were the best, the optimal, etc. I just know that Basque Wikipedia is quite little and we got many new sentences for Common Voice project.

Thanks, now I got it.

I suspect these manual filtering after the extraction might surface some issues with our current process.

We need to be able to run the extraction from our side (using blacklist and rules), and then you can ask for removals or fixes to the final list of sentences, but not adding any new sentences.

This is the only way we have to control that our legal constrains are enforced by the scripts.

Can you come with a proposal with these constrains in mind so we can move forward?

Thanks!

I have remembered that after generating the blacklist and before creating the scraper sentence list, we used some regular expressions just to reduce the automatic blacklist, because it contained more Basque valid words than invalid words. The reason was, as I explained, the additive properties of the language, which causes a lot of low repeated words.

So I think it would be useful for other languages with similar properties to Basque, interested in using the Wikipedia Scraper, a feature of “whitelist” or the possibility to define a list of regular expressions that avoid some kinds of words to be included in the blacklist. For example, many suffixes provoke a lot of Basque words to be included in the blacklist and the same regular expressions I used to make a manual clean, could be included in a configuration file: *gatik, *ganako, *rentzako, *rentzat, *rekin, *renganako… I used a lot and some of then I checked manually because there was a possibility of giving false positives: *ren, *ri, *ra… Obviously, the last ones can’t be included in this hypothetical parameter.
If you see it interesting for some languages different to Basque, I can create an Issue in the GitHub project, so other people can benefit of it.

@mkohler I remember we had or planned for a whitelisting feature right?

What we have is whitelisted symbols:

We do not have anything regarding whitelists for the blacklist, as that is not in our scope. That would be something for https://github.com/dabinat/cvtools.

From my point of view, the issue is just that the scraper doesn’t really fit to languages like Basque. I don’t know German, but as far as I know, suffixes are common too, so perhaps German speakers can understand better the problem I’m trying to explain or can explain it better than me to other language speakers. I think I’m not explaining myself clear enough because the subject is complex and abstract.

I’m developer, I work on R&D and I don’t see a way to address Basque language needs in a programmatic way to get a good Wikipedia Scrapper. I think trying to find hundreds of regular expressions for the blacklist and for the result sentence list, is a very wrong approach. Many regular expressions would be extremely difficult to define. Many regular expressions would affect just to some little sentences, sometimes just one. That’s way I think that trying to apply JUST a programmatic strategy in all the cases will discriminate Basque language (and probably other languages if they try to use the scraper to make part of the work).

If you need to “run the extraction from your side” the only solution I see is that you just execute the scrapper with the config included in our pull request, and then match the lines of your result with the lines of our result. The lines that are equal in both sides, are good sentences and you can be sure they are inside Basque Wikipedia. This way, we will lose all the hours dedicated to make spell corrections, typos, orthography, grammar, capitalisation, commas, etc. Also the very occasional substitutions we made of some problematic foreign words with Basque ones (I don’t remember any right now, but perhaps something like this: Dostoevsky --> Etxeberria).

The people behind the Basque sentence compilation (basically me), can’t repeat the work done during the last months, and I see the programmatic approach a dead-end street, so right now I don’t see any other option to save at least part of the work done during the last months and get all the new sentences that Basque voice recordings need.

Michael has just created a chat room over Matrix just for this topic. Feel free to join so we can discuss options more sync and see how we can bring Basque extraction as soon as possible.

Thanks!

I just ran a new scraping run where I added numbers under the replacements like:
replacements = [
[" 1 ", " ien “],
[” 2 ", " twa "],

]

To be sure it doesn’t replace all numbers, only the loose ones, I added a space before and after the numbers. This resulted in a 1.000 sentences more on a total of 48.000 and the new sentences seem all OK.

Next up: compare the created blacklist with 250.000 words, to the Frisian dictionary from the Frisian spellchecking add-on to filter out correct words. Any nix-command which can do that?

Out of curiosity, how far did you go? And how much does that incrase the time to run the script?

I went from 1-31, 40, 50-90, 100, 200-900 and 1000. I didn’t time it, but I can’t say it took way more time.
My idea now is to create some sort of whitelist with most used numbers, the opposite of the word blacklist, and put like the top 100 in the list of replacements. If someone knows the correct grep-command to fill the whitelist, that would be very helpful.

I’m not sure if that’s a good idea. If there is a sentence like “In year 2001…” it will become “In year two zero zero one…”?

That’s why I added the spaces before and after the number to separate them from larger numbers like years. It works fine that way.

1 Like

Ah, OK, I got it! Thank you for your quick answer :slight_smile: