[Technical feedback needed] Wikipedia extractor script beta

stergro · August 29, 2019, 11:41am

Hello,
Right now there are only a few thousand sentences available for Esperanto and we already have a lot of duplicates recorded. (See here why this is bad )
I used the Common Voice Wiki Scraper to get more public Domain sentences in Esperanto.

Since the Esperanto Community is small and the 1.2 Million hour aim is not realistic for us I decided to use very strict rules to get fewer sentences in higher quality. In my fist run I could get ~260 000 sentences, but after I applied some rules and added a Blacklist it boiled down to ~128 000 sentences and 96 000 without repetitiond. This is what I did:

I excluded most letters that are not part of the Esperanto alphabet to excluded foreign words and phrases. I also excluded q, w, x and y which are not part of the Esperanto alphabet and this helped a lot to get rid of all sort of Names and Words in other Languages. Here is the rule file I created for that
This file also includes some typical Esperanto abbreviations that I often found in the sentences and some stuff I shamelesly stole from other languages like deleting of double spaces.
I also excluded unusal letter combinations that are only used in foreign words like sch, the, sh, cc,… this helped a lot to avoid german, english and italian words that are verry common.
I created a blacklist with uncommon words, most of them are not in Esperanto. I choose to exclude words that are used less than 27 times. Thats much less than most other langues have chosen, but this wiki is smaller and the blacklist still contains more than one million words. EDIT: I later switched to >80 repetitions.
I sorted everything alphabetically. This helped me to delete dublicates and I found some Russian sentences that somehow made it into the collection at the end of the list.
After that I mixed the sentences in random matter again so that it feels natural again.

The result is this list of sentences (6.8 MB): https://raw.githubusercontent.com/stefangrotz/common-voice-eo-vikipedio/master/wiki.eo-80.txt
This is the list without dublicates: https://raw.githubusercontent.com/stefangrotz/common-voice-eo-vikipedio/master/no-dublicates-80.txt
Here are 300 randomized sentences from the latest extraction: https://github.com/stefangrotz/common-voice-eo-vikipedio/blob/master/random300-github-review.txt

The error rate is pretty low, but there are still many non-Esperanto words in the sentences that I would like to avoid.

Where do we go from now? Do I have to put them all manually in the sentence collector or is there a better way?

Edit: updated files and numbers from my latest runs.

Topic		Replies	Views
Future of the Sentence Extractor - Your input is required Common Voice sentence-collection	11	1822	May 28, 2021
Question about CV Sentence Extractor quality and your experience Common Voice	18	1557	August 30, 2023
Bulk sentences submission from Wikipedia Common Voice sentence-collection	4	585	August 12, 2024
About the new English Sentences Common Voice feedback , issue	37	3316	May 31, 2019
Scraper - Automatic sample sentences extracted in Pull Request Common Voice	1	1563	March 5, 2020

[Technical feedback needed] Wikipedia extractor script beta

Related topics