[Technical feedback needed] Wikipedia extractor script beta


Important: Please go to this topic for the most updated information about the sentence extractor and how to use it.


Hello everyone,

As we announced a few months ago, as an effort to get sentences from large sources of data, we found a legal way to extract some wikipedia sentences for our corpus under Public Domain, so we can have enough volume to accommodate the minimum 2000hrs of voice we want to collect to create a basic model for speech recognition.

Today I want to share the script we have been developing, so the more technical people can take an early look, play with it and help us improve so it can be automated and non-technical people can engage in the future creating per-language rules without any technical knowledge.


Important: Please, note we won’t be accepting any pull request with sentences created using this script.

Don’t use the sentence collector to send wikipedia sentences, we must follow the process described here for legal reasions.


Skills you will need to test this script:

  • Basic git
  • Basics on installing rust and run python scripts
  • A decent machine (the scripts are CPU heavy)
  • Basic bash to check the generated files.

The main things we want technical contributors to test are:

  • Download the code.
  • Download the wikipedia corpus for your language.
  • Create a rules file for your language.
  • Create a blacklist based on less common words.
  • Run the extraction using these rules and blacklist.
  • Report back here the experience, what to improve and the results for your language (how many strings you got, how is the overall quality).
  • File any github issues on bugs.
  • Any ideas on how to automate the process to allow non-technical people to generate rules and blacklists.

Again, we won’t be using the strings you generated directly, we want to make sure first the code works for other languages (it did for English, French, German and Spanish) and see the rules and blacklist files you generated.

Thanks!

Update: We have a chat room over Matrix just for this topic. Feel free to join.

5 Likes

Hello,

I tried the script (using corrected rules and generated blacklist files) and got about 100,000 sentences from the Georgian Wikipedia (with 130,000 articles in it). But, there are duplicated and incomplete ones as well.

This is great, thanks for testing.

  • How many duplicates did you find? (we are working on integrating a duplicates filter)
  • Did you have the chance to check with other Georgian native speakers a couple of small samples (100-500 sentences) to evaluate the error rate?

Thanks!

There were around 5 000 duplicates (5-6%) and about 10 of every 150 sentences were incomplete. It seems, they are split after abbreviations with periods.

Thanks for the feedback, we have now a few issues to deal with both duplicates and better tokenization:

Thank you, I’ll check them out.

For Kabyle I got an encoding error. I replaced the content of cp1252.py (Western Europe) the content of utf_8.py (all languages). It works.

1 Like

Can you please fill a issue on github so we can fix the encoding error? Thanks!

Hi,

Will this script automatically be applied to all languages when it is finalised? I’m currently working on extracting sentences from the Maltese text corpus to add them to the sentence collector, but this corpus already contains sentences from Wikipedia articles.

Should I include Wikipedia sentences from the corpus as long as they stick to the 3 random sentences per article rule, or should I avoid these for now and only use the rest of the corpus?

Due legal constrains we will be asking only for rules files for each language, we will be running the extraction based on the rules to ensure with our systems that all rules are correctly applied.

Feel free to send a PR with the Maltese rules (and maybe blacklist file) and a description about:

  • How many sentences did you got?
  • How did you generate the blacklist?
  • A review from at least 2-3 native speakers estimating the error ratio from a few samples from your full output.

We haven’t automated anything yet, that’s part of the scoping work that we want to start by September, based on the experiences from this technical feedback.

Thanks!

Hello,
Right now there are only a few thousand sentences available for Esperanto and we already have a lot of duplicates recorded. (See here why this is bad )
I used the Common Voice Wiki Scraper to get more public Domain sentences in Esperanto.

Since the Esperanto Community is small and the 1.2 Million hour aim is not realistic for us I decided to use very strict rules to get fewer sentences in higher quality. In my fist run I could get ~260 000 sentences, but after I applied some rules and added a Blacklist it boiled down to ~128 000 sentences and 96 000 without repetitiond. This is what I did:

  • I excluded most letters that are not part of the Esperanto alphabet to excluded foreign words and phrases. I also excluded q, w, x and y which are not part of the Esperanto alphabet and this helped a lot to get rid of all sort of Names and Words in other Languages. Here is the rule file I created for that
  • This file also includes some typical Esperanto abbreviations that I often found in the sentences and some stuff I shamelesly stole from other languages like deleting of double spaces.
  • I also excluded unusal letter combinations that are only used in foreign words like sch, the, sh, cc,… this helped a lot to avoid german, english and italian words that are verry common.
  • I created a blacklist with uncommon words, most of them are not in Esperanto. I choose to exclude words that are used less than 27 times. Thats much less than most other langues have chosen, but this wiki is smaller and the blacklist still contains more than one million words. EDIT: I later switched to >80 repetitions.
  • I sorted everything alphabetically. This helped me to delete dublicates and I found some Russian sentences that somehow made it into the collection at the end of the list.
  • After that I mixed the sentences in random matter again so that it feels natural again.

The result is this list of sentences (6.8 MB): https://raw.githubusercontent.com/stefangrotz/common-voice-eo-vikipedio/master/wiki.eo-80.txt
This is the list without dublicates: https://raw.githubusercontent.com/stefangrotz/common-voice-eo-vikipedio/master/no-dublicates-80.txt
Here are 300 randomized sentences from the latest extraction: https://github.com/stefangrotz/common-voice-eo-vikipedio/blob/master/random300-github-review.txt

The error rate is pretty low, but there are still many non-Esperanto words in the sentences that I would like to avoid.

Where do we go from now? Do I have to put them all manually in the sentence collector or is there a better way?

Edit: updated files and numbers from my latest runs.

@stergro I see you have created a pull request, we’ll just need a few details about the output:

Hey @nukeador thanks for the quick reaction. I closed my pull request and there will be another one in a few days.

I worked on the rules, added more abbreviations and most importantly I made a analysis about the frequency of the letters in my last sentence list. This helped me a lot to colect a big number of letters that are now also excluded in the rule file. This slows everything down but helps a lot with the quality. There are still a few foreign words in some sentences, but they are all at least written in letters that exists in the Esperanto alphabet.

Hey my fellow Esperantists @tirifto @Pablo_Busto @Mte90 @nicolaruggiero1986 Would you like to help me to guess the error rate for these new sentences from the Wikpedia in Esperanto? I created a file with 300 random sentences from the 96 000 sentences. I would guess that we have an error rate around 5/100, what do you think? It is enough if you only look at the first 100 or 200 sentences but it is important that you give me a number because we can only get the file into Common Voice if we have an error rate.

In esperanto is more simple compared to other languages because the alphabet has specific letters so I don’t think that will be a problem with the extractor.
Compared to italian or spanish that we need to exclude as example greek or german letters with esperanto is more simple because rewrite the words using their own alphabet.

That’s not completely true, I excluded all letters that are not part of the Esperanto alphabet with the script but only a fraction of the articles translated everything into the Esperantoaalphabet. Since Wikipedia is a dictionary there are still a lot of words in the texts that are not trancripted into Esperanto. One example:

Ĝi ŝuldas sian nomon al itala urbo Lecce, ĉefurbo de la provinco Lecce.

But the error rate is more about general errors like cutted sentences, grammar errors, typos and so on.

Thanks to the Catalan and Esperanto communities feedback I’ve updated the repo readme to clarify the expectations on how to get rules into the repo and also sentences extracted and incorporated into CV

1 Like

I think that is acceptable as error rate, start focusing on wrong translation for esperanto can be a pain and a very huge task.
Also for grammar errors this is something that is quite impossible to fix if not working on wikipedia itself, that was chosen because is more easy to avoid issues.
Probably will be more simple for esperanto to do a dictionary and check if all the words of the sentence are in there but will be very expensive for the tool to do that.
For cutted sentences again it is a problem of the scraper and not of the language that need to be reported.

I will see if there are “disallowed” words for italian or something like that.

I already created a blacklist with over a million disallowed words based on repetition. I choose words that apear less than 27 times in all texts, maybe this was too little. Other languages choosed to avoid words that appear less than 80 times but this would include many valid words. I will create another blacklist with more words this evening.

1 Like

Okay I just started a new run with a blacklist for words less frequent than 80. This was hard, because this list includes a lot of valid and interesting words, but also a lot of nonsense. But there are still 28 000 words allowed to build sentences from, so I think this will work.

Since this thread is mostly about the script and technical questions, there are two things I would like to know:

  • I understand that you don’t accept pull requests with sentences from this script and that you have to run the script yourself to avoid legal problems when someone pulled too many sentences per article. But can I edit the result once the file is ready? I would only delete sentences and add nothing new. Some common errors are easier to delete by hand.
  • For Esperanto it turned out to be extremely useful to exclude almost all letters that are not part of the Esperanto alphabet with disallowed_symbols. But this was a lot of work and slows everything down a lot. It would be much more useful to have a whitelist of allowed signs. Is this possible? I bet this could be also useful for some other languages.
  • A lot of sentences are lost because of ignored abbreviations. It would be great if one could replace abbreviations with their full written meaning.

Edit: Done and I like the collection. To get rid of more foree words I also excluded letter combinations that are almost never used in esperanto, for example sch, sh, the,cc,… This filtered a lot more words out. I now get 128 000 and after deleting the duplicates 96 000 sentences. This would give us enough sentences for the next two or three years if we keep working in the same speed.

I updated the linked files in the post above and changed some text to make things clearer.

Edit: Abbreviations instead of apprehensions 🤦

Yes, only deletion PRs are accepted on sentences to remove wrong or bad sentences.

This might be something to consider if brings more quality to some languages. Can you please open a github issue so we can check with devs the best approach? Thanks.

This would probably fall into the new features category. Can you please also open a github issue about it so it doesn’t get lost? Thanks