[Technical feedback needed] Wikipedia extractor script beta

Hello everyone,

As we announced a few months ago, as an effort to get sentences from large sources of data, we found a legal way to extract some wikipedia sentences for our corpus under Public Domain, so we can have enough volume to accommodate the minimum 2000hrs of voice we want to collect to create a basic model for speech recognition.

Today I want to share the script we have been developing, so the more technical people can take an early look, play with it and help us improve so it can be automated and non-technical people can engage in the future creating per-language rules without any technical knowledge.


Important: Please, note we won’t be accepting any pull request with sentences created using this script.


Skills you will need to test this script:

  • Basic git
  • Basics on installing rust and run python scripts
  • A decent machine (the scripts are CPU heavy)
  • Basic bash to check the generated files.

The main things we want technical contributors to test are:

  • Download the code.
  • Download the wikipedia corpus for your language.
  • Create a rules file for your language.
  • Create a blacklist based on less common words.
  • Run the extraction using these rules and blacklist.
  • Report back here the experience, what to improve and the results for your language (how many strings you got, how is the overall quality).
  • File any github issues on bugs.
  • Any ideas on how to automate the process to allow non-technical people to generate rules and blacklists.

Again, we won’t be using the strings you generated directly, we want to make sure first the code works for other languages (it did for English, French, German and Spanish) and see the rules and blacklist files you generated.

Thanks!

2 Likes

Hello,

I tried the script (using corrected rules and generated blacklist files) and got about 100,000 sentences from the Georgian Wikipedia (with 130,000 articles in it). But, there are duplicated and incomplete ones as well.

This is great, thanks for testing.

  • How many duplicates did you find? (we are working on integrating a duplicates filter)
  • Did you have the chance to check with other Georgian native speakers a couple of small samples (100-500 sentences) to evaluate the error rate?

Thanks!

There were around 5 000 duplicates (5-6%) and about 10 of every 150 sentences were incomplete. It seems, they are split after abbreviations with periods.

Thanks for the feedback, we have now a few issues to deal with both duplicates and better tokenization:

Thank you, I’ll check them out.

For Kabyle I got an encoding error. I replaced the content of cp1252.py (Western Europe) the content of utf_8.py (all languages). It works.

1 Like