I ran the wiki-scraper script for Esperanto, what are the next steps now?

Hello,
Right now there are only a few thousand sentences available for Esperanto and we already have a lot of duplicates recorded. (See here why this is bad )
I used the Common Voice Wiki Scraper to get more public Domain sentences in Esperanto.

Since the Esperanto Community is small and the 1.2 Million hour aim is not realistic for us I decided to use very strict rules to get fewer sentences in higher quality. In my fist run I could get ~260 000 sentences, but after I applied some rules and added a Blacklist it boiled down to ~130 000 sentences. This is what I did:

  • I excluded most letters that are not part of the Esperanto alphabet to excluded foreign words and phrases. I also excluded q, w, x and y which are not part of the Esperanto alphabet and this helped a lot to get rid of all sort of Names and Words in other Languages. Here is the rule file I created for that
  • This file also includes some typical Esperanto apprehensions that I often found in the sentences and some stuff I shamelesly stole from other languages like deleting of double spaces.
  • I created a blacklist with uncommon words, most of them are not in Esperanto. I choose to exclude words that are used less than 27 times. Thats much less than most other langues choosed, but this wiki is smaller and the blacklist still contains more than one million words.
  • I also put two common words on the blacklist: distrikto and loĝantojn because otherwise there would be thousands of sentences just about cities being located in districts and how many inhabitants they have.
  • I sorted everything alphabetically. This helped me to delete dublicates and I found some Russian sentences that somehow made it into the collection at the end of the list.
  • After that I mixed the sentences in random matter again so that it feels natural again.

The result is this list of sentences (7.8 MB): https://raw.githubusercontent.com/stefangrotz/common-voice-wiki-scraper/master/wiki.eo.no-dublicates.non-alphabetical.txt
Here is an abstract of the first 2 000 sentences: https://github.com/stefangrotz/common-voice-wiki-scraper/blob/master/first2000.txt

The error rate is pretty low, but there are still many non-Esperanto words in the sentences that I would like to avoid.

Where do we go from now? Do I have to put them all manually in the sentence collector or is there a better way?

A post was merged into an existing topic: [Technical feedback needed] Wikipedia extractor script beta