Sentence Extractor - Current Status and Workflow Summary

mkohler · June 20, 2020, 9:23pm

What is the Sentence Extractor

Common Voice is Mozilla’s initiative to help teach machines how real people speak. For this we need to collect sentences that people can read out aloud on the website. Individual sentences can be submitted through the Sentence Collector. This only can scale so far, so we also the Sentence Extractor (formerly Wiki Scraper) to extract sentences from other sources.

Currently the only implemented source is Wikipedia. We are allowed to export a maximum of 3 sentences per article. Other sources can be integrated as well, so the same rule files can be used. See at the bottom of this post for a short explanation.

Skills needed to export a new languages

Comfortable with GitHub Pull Requests
Comfortable with writing Regular Expressions or willingness to dive fairly deep into it (https://docs.rs/regex/1.3.9/regex/)
Comfortable running Python and Rust scripts on your machine

See below for the specific flow.

Languages exported

The following languages were already exported:

ca
cs
de
en
eo
es
fr
it
ka
zh-CN

Exported sentences can be found in their locale subfolder in the voice-web repository.

Currently we can’t re-run exports for these languages. One idea to look into is to re-run it for articles that were created after the export date.

Languages with open PRs

The following languages currently are being worked on, or have had some work done and need attention:

General flow

New rules file is created as a Pull Request at https://github.com/Common-Voice/cv-sentence-extractor. The configuration options are described in the README. You can also take the English rules file as a base and adjust it to your language (note that the English file does not use all possible rules options).
The GitHub Action automatically extracts a sample which can be used to verify the rules file and find invalid sentences. More infos can be fount in its own Discourse topic.
Improve the rules file based on the output above
In many cases using a blocklist has drastically improved the sentence quality. See the README for an explanation on how to create a blocklist. This takes quite some time and internet bandwidth, as it currently needs to be run locally on your machine. I’ve filed https://github.com/Common-Voice/cv-sentence-extractor/issues/108 to improve that step.
Once the blocklist and rules files are done and the sample output looks good, send the sample output to others for verification (see next chapter)

You can check https://github.com/Common-Voice/cv-sentence-extractor/pull/90 to see an example for a full work flow.

Approval criteria/questions

How many sentences did you get at the end?
How did you create the blocklist?
Get at least 3 different native speakers (ideally linguistics) to review a random sample of 100-500 sentences and estimate the average error ratio and comment (or link their comment) in the PR.

Integration workflow

Once the Pull Request gets merged, we tag it with a special flag, which will trigger the automatic export. Please note that any official export needs to be done this way, we do not allow self-exported files to be added to Common Voice to make sure that the legal requirements are fulfilled. This means that any changes need to be done to the rules and blocklist files, there can’t be any manual cleanup on the resulting file. If there is cleanup needed, you can clean it up once it’s merged into the Common Voice repository.

When the automatic full extraction is done, we will add a Pull Request to the voice-web repo adding the new sentences.

Adding new data sources

Even though the script currently only supports Wikipedia, new data sources can be added. We need to make sure that these data sets have the right license before adding them though. It’s best to create a Discourse post here to discuss it before working on it. Once it’s clear it can be used, refer to the README on how to technically add a new source, or tag me in the relevant Discourse topic and I’ll see if I can help out getting it integrated into the script.

Getting support

If you have any questions or need help, please create a new topic here on Discourse or jump into our dedicated Matrix Channel. We’re happy to help out!

mkohler · July 5, 2020, 1:14pm

I have added two manual triggers to the GitHub Actions.

Run Blocklist Generation

Until now you had to run a full extraction locally to generate the blocklist. Now this can be done through a comment on any issue. If you already have a PR open, use that. If not, create your own issue, so we don’t spam unrelated issues.

To trigger the job, add the following line in a new comment:

/action blocklist [language code] [max occurances of words]

For example: /action blocklist en 80

The job will then post a link to the GitHub Action run where you will find the resulting files (artifacts).

You can see an example here: https://github.com/Common-Voice/cv-sentence-extractor/issues/108#issuecomment-653887129

Trigger a full extraction

Additionally to the full extraction that can run when a PR gets merged, the following comment format will trigger a full extraction as well. Usually the sample extraction when creating and pushing new commits to a PR should be enough, however if you need more sentences to verify your ruels than what the sample extraction provides, you can use this method.

/action extract [language code]

Example: /action extract en

Blito · July 25, 2020, 6:09am

Currently we can’t re-run exports for these languages. One idea to look into is to re-run it for articles that were created after the export date.

Hi Michael,

I was wondering how that plays out with the single sentence record limit feature that was announced. There’s some math here that suggests that a language needs 2000+ hours of recordings, which at a rate of 4 seconds/clip is at least 1.8M sentences that now have to be unique.

Am I thinking about this problem correctly? If so, it means that this tool will become much more critical.

I was looking at Spanish as an example, and I see that there are 10836 sentences here. Are those all the sentences that were added to the database using the sentence-extractor? Is there a way to see how many sentences are there in the database total (including those that were manually added using the sentence-collector)?

I see the export date for Spanish is 2020-06-11. When you said:

One idea to look into is to re-run it for articles that were created after the export date.

Does that mean that we can’t extract any more sentences from Wikipedia articles that were created before that date?

I saw there’s a cap on 3 sentences per article (here). Assuming there are 30k sentences in the Spanish database (10k submitted through the extractor + 20k? through the manual collector), that means that we still need 1.77M (1.8M - 30k) sentences. At 3 per article, that’s 590k articles. I found this, and it looks like Spanish Wikipedia added 100k new articles from Dec2017 to Dec2018 (latest data available there). We would need 5+ years of new articles to have enough data lol.

Sorry for all the questions. I really like the idea of the project. I’ve been using Rust at work for a while now and figured I could contribute here. I’m trying to understand where it fits in the broader project, and whether it’s something worth investing more time in.

Thanks!

mkohler · July 26, 2020, 12:21pm

The file you’re referencing (sentence-collector.txt - 10836 sentences) is the export from the Sentence Collector. Those are the manually uploaded sentences. That’s also what the date 2020-06-11 refers to.

The Wikipedia extract (done through the Sentence Extractor - see this PR) you can find in the wiki.es.txt file. That contains 1172326 sentences. That import was done on July 19th 2019, so we could only use articles created in roughly the last year if we re-did the export.

I’d say it definitely is, as it allows way more sentences to be extracted, and it also would in theory support other data sources. However not all languages have a lot of articles on Wikipedia, that’s why the Sentence Collector is helpful too. The most important issue for the Sentence Extractor to look into is Improve sentence separation · Issue #11 · common-voice/cv-sentence-extractor · GitHub if you want to have a look and have any feedback for that.

Blito · July 26, 2020, 9:15pm

Awesome, it’s not as bad as I thought then haha. Thanks for your answer!

I’ll take a look at that issue, although it’s not within my domain at all. I may start by going at some of the other open issues that seem easier, at least to get my feet wet in the codebase.

Thanks!