As we announced a few months ago, as an effort to get sentences from large sources of data, we found a legal way to extract some wikipedia sentences for our corpus under Public Domain, so we can have enough volume to accommodate the minimum 2000hrs of voice we want to collect to create a basic model for speech recognition.
Today I want to share the script we have been developing, so the more technical people can take an early look, play with it and help us improve so it can be automated and non-technical people can engage in the future creating per-language rules without any technical knowledge.
Important: Please, note we won’t be accepting any pull request with sentences created using this script.
Don’t use the sentence collector to send wikipedia sentences, we must follow the process described here for legal reasions.
Skills you will need to test this script:
- Basic git
- Basics on installing rust and run python scripts
- A decent machine (the scripts are CPU heavy)
- Basic bash to check the generated files.
The main things we want technical contributors to test are:
- Download the code.
- Download the wikipedia corpus for your language.
- Create a rules file for your language.
- Create a blacklist based on less common words.
- Run the extraction using these rules and blacklist.
- Report back here the experience, what to improve and the results for your language (how many strings you got, how is the overall quality).
- File any github issues on bugs.
- Any ideas on how to automate the process to allow non-technical people to generate rules and blacklists.
Again, we won’t be using the strings you generated directly, we want to make sure first the code works for other languages (it did for English, French, German and Spanish) and see the rules and blacklist files you generated.
Update: We have a chat room over Matrix just for this topic. Feel free to join.