Sentence Extractor - Current Status and Workflow Summary

What is the Sentence Extractor

Common Voice is Mozilla’s initiative to help teach machines how real people speak. For this we need to collect sentences that people can read out aloud on the website. Individual sentences can be submitted through the Sentence Collector. This only can scale so far, so we also the Sentence Extractor (formerly Wiki Scraper) to extract sentences from other sources.

Currently the only implemented source is Wikipedia. We are allowed to export a maximum of 3 sentences per article. Other sources can be integrated as well, so the same rule files can be used. See at the bottom of this post for a short explanation.

Skills needed to export a new languages

  • Comfortable with GitHub Pull Requests
  • Comfortable with writing Regular Expressions or willingness to dive fairly deep into it (https://docs.rs/regex/1.3.9/regex/)
  • Comfortable running Python and Rust scripts on your machine

See below for the specific flow.

Languages exported

The following languages were already exported:

  • ca
  • cs
  • de
  • en
  • eo
  • es
  • fr
  • it
  • ka
  • zh-CN

Exported sentences can be found in their locale subfolder in the voice-web repository.

Currently we can’t re-run exports for these languages. One idea to look into is to re-run it for articles that were created after the export date.

Languages with open PRs

The following languages currently are being worked on, or have had some work done and need attention:

General flow

  • New rules file is created as a Pull Request at https://github.com/Common-Voice/cv-sentence-extractor. The configuration options are described in the README. You can also take the English rules file as a base and adjust it to your language (note that the English file does not use all possible rules options).
  • The GitHub Action automatically extracts a sample which can be used to verify the rules file and find invalid sentences. More infos can be fount in its own Discourse topic.
  • Improve the rules file based on the output above
  • In many cases using a blocklist has drastically improved the sentence quality. See the README for an explanation on how to create a blocklist. This takes quite some time and internet bandwidth, as it currently needs to be run locally on your machine. I’ve filed https://github.com/Common-Voice/cv-sentence-extractor/issues/108 to improve that step.
  • Once the blocklist and rules files are done and the sample output looks good, send the sample output to others for verification (see next chapter)

You can check https://github.com/Common-Voice/cv-sentence-extractor/pull/90 to see an example for a full work flow.

Approval criteria/questions

  • How many sentences did you get at the end?
  • How did you create the blocklist?
  • Get at least 3 different native speakers (ideally linguistics) to review a random sample of 100-500 sentences and estimate the average error ratio and comment (or link their comment) in the PR.

Integration workflow

Once the Pull Request gets merged, we tag it with a special flag, which will trigger the automatic export. Please note that any official export needs to be done this way, we do not allow self-exported files to be added to Common Voice to make sure that the legal requirements are fulfilled. This means that any changes need to be done to the rules and blocklist files, there can’t be any manual cleanup on the resulting file. If there is cleanup needed, you can clean it up once it’s merged into the Common Voice repository.

When the automatic full extraction is done, we will add a Pull Request to the voice-web repo adding the new sentences.

Adding new data sources

Even though the script currently only supports Wikipedia, new data sources can be added. We need to make sure that these data sets have the right license before adding them though. It’s best to create a Discourse post here to discuss it before working on it. Once it’s clear it can be used, refer to the README on how to technically add a new source, or tag me in the relevant Discourse topic and I’ll see if I can help out getting it integrated into the script.

Getting support

If you have any questions or need help, please create a new topic here on Discourse or jump into our dedicated Matrix Channel. We’re happy to help out!

2 Likes

I have added two manual triggers to the GitHub Actions.

Run Blocklist Generation

Until now you had to run a full extraction locally to generate the blocklist. Now this can be done through a comment on any issue. If you already have a PR open, use that. If not, create your own issue, so we don’t spam unrelated issues.

To trigger the job, add the following line in a new comment:

/action blocklist [language code] [max occurances of words]

For example: /action blocklist en 80

The job will then post a link to the GitHub Action run where you will find the resulting files (artifacts).

You can see an example here: https://github.com/Common-Voice/cv-sentence-extractor/issues/108#issuecomment-653887129

Trigger a full extraction

Additionally to the full extraction that can run when a PR gets merged, the following comment format will trigger a full extraction as well. Usually the sample extraction when creating and pushing new commits to a PR should be enough, however if you need more sentences to verify your ruels than what the sample extraction provides, you can use this method.

/action extract [language code]

Example: /action extract en