[Technical feedback needed] Wikipedia extractor script beta

Thank you, I’ll check them out.

For Kabyle I got an encoding error. I replaced the content of cp1252.py (Western Europe) the content of utf_8.py (all languages). It works.

1 Like

Can you please fill a issue on github so we can fix the encoding error? Thanks!

Hi,

Will this script automatically be applied to all languages when it is finalised? I’m currently working on extracting sentences from the Maltese text corpus to add them to the sentence collector, but this corpus already contains sentences from Wikipedia articles.

Should I include Wikipedia sentences from the corpus as long as they stick to the 3 random sentences per article rule, or should I avoid these for now and only use the rest of the corpus?

Due legal constrains we will be asking only for rules files for each language, we will be running the extraction based on the rules to ensure with our systems that all rules are correctly applied.

Feel free to send a PR with the Maltese rules (and maybe blacklist file) and a description about:

  • How many sentences did you got?
  • How did you generate the blacklist?
  • A review from at least 2-3 native speakers estimating the error ratio from a few samples from your full output.

We haven’t automated anything yet, that’s part of the scoping work that we want to start by September, based on the experiences from this technical feedback.

Thanks!

Hello,
Right now there are only a few thousand sentences available for Esperanto and we already have a lot of duplicates recorded. (See here why this is bad )
I used the Common Voice Wiki Scraper to get more public Domain sentences in Esperanto.

Since the Esperanto Community is small and the 1.2 Million hour aim is not realistic for us I decided to use very strict rules to get fewer sentences in higher quality. In my fist run I could get ~260 000 sentences, but after I applied some rules and added a Blacklist it boiled down to ~128 000 sentences and 96 000 without repetitiond. This is what I did:

  • I excluded most letters that are not part of the Esperanto alphabet to excluded foreign words and phrases. I also excluded q, w, x and y which are not part of the Esperanto alphabet and this helped a lot to get rid of all sort of Names and Words in other Languages. Here is the rule file I created for that
  • This file also includes some typical Esperanto abbreviations that I often found in the sentences and some stuff I shamelesly stole from other languages like deleting of double spaces.
  • I also excluded unusal letter combinations that are only used in foreign words like sch, the, sh, cc,… this helped a lot to avoid german, english and italian words that are verry common.
  • I created a blacklist with uncommon words, most of them are not in Esperanto. I choose to exclude words that are used less than 27 times. Thats much less than most other langues have chosen, but this wiki is smaller and the blacklist still contains more than one million words. EDIT: I later switched to >80 repetitions.
  • I sorted everything alphabetically. This helped me to delete dublicates and I found some Russian sentences that somehow made it into the collection at the end of the list.
  • After that I mixed the sentences in random matter again so that it feels natural again.

The result is this list of sentences (6.8 MB): https://raw.githubusercontent.com/stefangrotz/common-voice-eo-vikipedio/master/wiki.eo-80.txt
This is the list without dublicates: https://raw.githubusercontent.com/stefangrotz/common-voice-eo-vikipedio/master/no-dublicates-80.txt
Here are 300 randomized sentences from the latest extraction: https://github.com/stefangrotz/common-voice-eo-vikipedio/blob/master/random300-github-review.txt

The error rate is pretty low, but there are still many non-Esperanto words in the sentences that I would like to avoid.

Where do we go from now? Do I have to put them all manually in the sentence collector or is there a better way?

Edit: updated files and numbers from my latest runs.

@stergro I see you have created a pull request, we’ll just need a few details about the output:

Hey @nukeador thanks for the quick reaction. I closed my pull request and there will be another one in a few days.

I worked on the rules, added more abbreviations and most importantly I made a analysis about the frequency of the letters in my last sentence list. This helped me a lot to colect a big number of letters that are now also excluded in the rule file. This slows everything down but helps a lot with the quality. There are still a few foreign words in some sentences, but they are all at least written in letters that exists in the Esperanto alphabet.

Hey my fellow Esperantists @tirifto @Pablo_Busto @Mte90 @nicolaruggiero1986 Would you like to help me to guess the error rate for these new sentences from the Wikpedia in Esperanto? I created a file with 300 random sentences from the 96 000 sentences. I would guess that we have an error rate around 5/100, what do you think? It is enough if you only look at the first 100 or 200 sentences but it is important that you give me a number because we can only get the file into Common Voice if we have an error rate.

In esperanto is more simple compared to other languages because the alphabet has specific letters so I don’t think that will be a problem with the extractor.
Compared to italian or spanish that we need to exclude as example greek or german letters with esperanto is more simple because rewrite the words using their own alphabet.

That’s not completely true, I excluded all letters that are not part of the Esperanto alphabet with the script but only a fraction of the articles translated everything into the Esperantoaalphabet. Since Wikipedia is a dictionary there are still a lot of words in the texts that are not trancripted into Esperanto. One example:

Ĝi ŝuldas sian nomon al itala urbo Lecce, ĉefurbo de la provinco Lecce.

But the error rate is more about general errors like cutted sentences, grammar errors, typos and so on.

Thanks to the Catalan and Esperanto communities feedback I’ve updated the repo readme to clarify the expectations on how to get rules into the repo and also sentences extracted and incorporated into CV

1 Like

I think that is acceptable as error rate, start focusing on wrong translation for esperanto can be a pain and a very huge task.
Also for grammar errors this is something that is quite impossible to fix if not working on wikipedia itself, that was chosen because is more easy to avoid issues.
Probably will be more simple for esperanto to do a dictionary and check if all the words of the sentence are in there but will be very expensive for the tool to do that.
For cutted sentences again it is a problem of the scraper and not of the language that need to be reported.

I will see if there are “disallowed” words for italian or something like that.

I already created a blacklist with over a million disallowed words based on repetition. I choose words that apear less than 27 times in all texts, maybe this was too little. Other languages choosed to avoid words that appear less than 80 times but this would include many valid words. I will create another blacklist with more words this evening.

1 Like

Okay I just started a new run with a blacklist for words less frequent than 80. This was hard, because this list includes a lot of valid and interesting words, but also a lot of nonsense. But there are still 28 000 words allowed to build sentences from, so I think this will work.

Since this thread is mostly about the script and technical questions, there are two things I would like to know:

  • I understand that you don’t accept pull requests with sentences from this script and that you have to run the script yourself to avoid legal problems when someone pulled too many sentences per article. But can I edit the result once the file is ready? I would only delete sentences and add nothing new. Some common errors are easier to delete by hand.
  • For Esperanto it turned out to be extremely useful to exclude almost all letters that are not part of the Esperanto alphabet with disallowed_symbols. But this was a lot of work and slows everything down a lot. It would be much more useful to have a whitelist of allowed signs. Is this possible? I bet this could be also useful for some other languages.
  • A lot of sentences are lost because of ignored abbreviations. It would be great if one could replace abbreviations with their full written meaning.

Edit: Done and I like the collection. To get rid of more foree words I also excluded letter combinations that are almost never used in esperanto, for example sch, sh, the,cc,… This filtered a lot more words out. I now get 128 000 and after deleting the duplicates 96 000 sentences. This would give us enough sentences for the next two or three years if we keep working in the same speed.

I updated the linked files in the post above and changed some text to make things clearer.

Edit: Abbreviations instead of apprehensions 🤦

Yes, only deletion PRs are accepted on sentences to remove wrong or bad sentences.

This might be something to consider if brings more quality to some languages. Can you please open a github issue so we can check with devs the best approach? Thanks.

This would probably fall into the new features category. Can you please also open a github issue about it so it doesn’t get lost? Thanks

Okay I opened an isse about the whitelist here #50 and there is already an old Isse about converting abbrevations: #9

1 Like
  • I couldn’t install “cargo” so another Basque contributor made the extraction for me.
  • Basque Wikipedia’s result is available here (25MB): https://ikusimakusi.eus/bitartekoak/wiki.eu.txt
  • It contains about 399.474 rows, 200.362 of them repeated. So the unique sentences on the file (after a sort -u) are 49,85%.
  • I checked the first 100 sentences and 90 are right. 10 are wrongly cut apparently because the same reason; all they start with “mendean” [in the century], probably because the scrapper cuts sentences like “XX. mendean…” [in the 20th century…] as “XX.” and “mendean…”.
    • Basque, as other languages, uses Roman numerals to express centuries. But surely the problem isn’t that, but that Basque uses dot character to separate sentences but also to express ordinals: “1. atera” [To the 1st gate], “2. atean” [On the 2nd gate] “XIII. mendea” [13rd century], etc.
      • As we don’t want acronyms nor digits on Common Voice’s sentences, I think this cutting problem (at least for Basque language) could be fixed skipping digits ([0-9]) and contiguous upper-cases ([A-Z][A-Z]). This wouldn’t fix Roman ordinals with just one character like “I”, “V” and “X” but would help. It’s just an idea…

These are great advances @txopi :smiley:

You might be able to play with the rules files and the blacklist to avoid Roman ordinals. Other people in this topic would be able to help with the regex.

Once you have a set of rules and blacklist that produce an output that is rated as <7% error rate by 2-3 native speakers, feel free to open a PR adding the following information:

  • How many sentences are you getting?
  • How did you create the blacklist? (specify the criteria, i.e words with <80 repetitions)
  • Get 2-3 additional native speakers (ideally some linguistics) to comment here with the estimated error rate. You can share with them a few samples of 500 random sentences from your output.

Cheers.

Hi, trying to collect for russian and
in step:
cargo run – extract -l russian -d …/wikiextractor/text/ >> wiki.ru.txt
there is error:
Compiling punkt v1.0.5
error[E0554]: #![feature] may not be used on the stable release channel
–> /home/user/.cargo/registry/src/github.com-1ecc6299db9ec823/punkt-1.0.5/src/lib.rs:141:1
|
141 | #![feature(proc_macro_hygiene)]
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error: aborting due to previous error

For more information about this error, try rustc --explain E0554.
error: Could not compile punkt.

To learn more, run the command again with --verbose.

Could you help with this?

@txopi awesome!! wiki links! i will check them out