Scraper - Automatic sample sentences extracted in Pull Request

Hi everyone

I’ve just merged a change to the infrastructure of the Common Voice Scraper. For every Pull Request there will be an automatic extraction of about 5000 sentences for you to validate your code changes. Additionally some automated tests are run to validate that your changes won’t break anything.

Edit - 2021-04-24: Note that GitHub does not automatically run the pipeline if you are a first time contributor. If your sample extraction doesn’t get approved within a day, please reach out to us on Matrix.

This will look like this:

The extraction takes between 5 and 10 minutes. Once it’s done, GitHub will tell you with a green mark:

To find the extracted sentences, click on “Show all checks” on the right side and click on the “Detail” link for the “Sample Extraction / extract” row. This will lead you to

where you can inspect the logs if wanted, or more importantly, download the extracted sentences through the top right corner “Artifacts” dropdown. It will download a zip file which contains one text file with all the extracted sample sentences. Note that these come from Wikipedia, so please do not upload them to the database, this will be done by the team to extract all sentences we can, and not just the sample. However with this sample you can validate your rules and further adjust them.

For this to work, make sure that

  • the rules file name is the ISO code, such as en.toml or hi.toml
  • the blacklist file has the same format, such as en.txt or hi.toml

Also note that the extraction command locally will need to be changed for the language parameter, passing en instead of english and so forth:

cargo run -- extract -l en -d path/to/files

If you’d like to contribute another language, please head over to [Technical feedback needed] Wikipedia extractor script beta to learn more. I’m also happy to answer any questions over at chat.mozilla.org in the #common-voice-scraper:mozilla.org room.

Also feel free to ask any questions regarding this sample extraction here.

Thanks!
Michael

1 Like

Thanks for this @mkohler, this is fundamental step into making content extraction accessible to non-technical contributors!