What is the Sentence Extractor
Common Voice is Mozilla’s initiative to help teach machines how real people speak. For this we need to collect sentences that people can read out aloud on the website. Individual sentences can be submitted through the Sentence Collector. This only can scale so far, so we also the Sentence Extractor (formerly Wiki Scraper) to extract sentences from other sources.
Currently the only implemented source is Wikipedia. We are allowed to export a maximum of 3 sentences per article. Other sources can be integrated as well, so the same rule files can be used. See at the bottom of this post for a short explanation.
Skills needed to export a new languages
- Comfortable with GitHub Pull Requests
- Comfortable with writing Regular Expressions or willingness to dive fairly deep into it (https://docs.rs/regex/1.3.9/regex/)
- Comfortable running Python and Rust scripts on your machine
See below for the specific flow.
The following languages were already exported:
Exported sentences can be found in their locale subfolder in the voice-web repository.
Currently we can’t re-run exports for these languages. One idea to look into is to re-run it for articles that were created after the export date.
Languages with open PRs
The following languages currently are being worked on, or have had some work done and need attention:
- New rules file is created as a Pull Request at https://github.com/Common-Voice/cv-sentence-extractor. The configuration options are described in the README. You can also take the English rules file as a base and adjust it to your language (note that the English file does not use all possible rules options).
- The GitHub Action automatically extracts a sample which can be used to verify the rules file and find invalid sentences. More infos can be fount in its own Discourse topic.
- Improve the rules file based on the output above
- In many cases using a blocklist has drastically improved the sentence quality. See the README for an explanation on how to create a blocklist. This takes quite some time and internet bandwidth, as it currently needs to be run locally on your machine. I’ve filed https://github.com/Common-Voice/cv-sentence-extractor/issues/108 to improve that step.
- Once the blocklist and rules files are done and the sample output looks good, send the sample output to others for verification (see next chapter)
You can check https://github.com/Common-Voice/cv-sentence-extractor/pull/90 to see an example for a full work flow.
- How many sentences did you get at the end?
- How did you create the blocklist?
- Get at least 3 different native speakers (ideally linguistics) to review a random sample of 100-500 sentences and estimate the average error ratio and comment (or link their comment) in the PR.
Once the Pull Request gets merged, we tag it with a special flag, which will trigger the automatic export. Please note that any official export needs to be done this way, we do not allow self-exported files to be added to Common Voice to make sure that the legal requirements are fulfilled. This means that any changes need to be done to the rules and blocklist files, there can’t be any manual cleanup on the resulting file. If there is cleanup needed, you can clean it up once it’s merged into the Common Voice repository.
When the automatic full extraction is done, we will add a Pull Request to the voice-web repo adding the new sentences.
Adding new data sources
Even though the script currently only supports Wikipedia, new data sources can be added. We need to make sure that these data sets have the right license before adding them though. It’s best to create a Discourse post here to discuss it before working on it. Once it’s clear it can be used, refer to the README on how to technically add a new source, or tag me in the relevant Discourse topic and I’ll see if I can help out getting it integrated into the script.
If you have any questions or need help, please create a new topic here on Discourse or jump into our dedicated Matrix Channel. We’re happy to help out!