Scraper - Automatic sample sentences extracted in Pull Request

mkohler · March 4, 2020, 8:58pm

Hi everyone

I’ve just merged a change to the infrastructure of the Common Voice Scraper. For every Pull Request there will be an automatic extraction of about 5000 sentences for you to validate your code changes. Additionally some automated tests are run to validate that your changes won’t break anything.

Edit - 2021-04-24: Note that GitHub does not automatically run the pipeline if you are a first time contributor. If your sample extraction doesn’t get approved within a day, please reach out to us on Matrix.

This will look like this:

The extraction takes between 5 and 10 minutes. Once it’s done, GitHub will tell you with a green mark:

To find the extracted sentences, click on “Show all checks” on the right side and click on the “Detail” link for the “Sample Extraction / extract” row. This will lead you to

where you can inspect the logs if wanted, or more importantly, download the extracted sentences through the top right corner “Artifacts” dropdown. It will download a zip file which contains one text file with all the extracted sample sentences. Note that these come from Wikipedia, so please do not upload them to the database, this will be done by the team to extract all sentences we can, and not just the sample. However with this sample you can validate your rules and further adjust them.

For this to work, make sure that

the rules file name is the ISO code, such as en.toml or hi.toml
the blacklist file has the same format, such as en.txt or hi.toml

Also note that the extraction command locally will need to be changed for the language parameter, passing en instead of english and so forth:

cargo run -- extract -l en -d path/to/files

If you’d like to contribute another language, please head over to [Technical feedback needed] Wikipedia extractor script beta to learn more. I’m also happy to answer any questions over at chat.mozilla.org in the #common-voice-scraper:mozilla.org room.

Also feel free to ask any questions regarding this sample extraction here.

Thanks!
Michael

Topic		Replies	Views
Sentence Extraction now automated Common Voice	4	1306	March 19, 2020
Sentence Extractor - Current Status and Workflow Summary Common Voice sentence-collection	4	3407	July 26, 2020
[Technical feedback needed] Wikipedia extractor script beta Common Voice sentence-collection , feedback	76	8350	July 1, 2020
[Common Voice] Technical help needed to grow our sentence diversity DeepSpeech	0	933	July 30, 2019
Sentence Collector - Automated Export Common Voice	0	383	October 16, 2020

Scraper - Automatic sample sentences extracted in Pull Request

Related topics