To extract sentences from a data source such as Wikipedia (three sentences per article maximum) we have the Sentence Extractor script. This can extract sentences from Wikipedia and - in theory - other compatible data sources (see details about data sources).
Until recently the extraction had to be run manually once the rules were approved. This took quite some time, as somebody had to run the script on their computer - and therefore had to be prioritized depending on other open tasks. Since yesterday this is mostly automated. Whenever we approve a rule set and merge the Pull Request, we can mark it so it automatically starts a job to run the full script. All we need to do manually (for now? ;)) is to download the resulting file and create a Pull Request for it to be included in the Common Voice repository. This allows us to bring the output of work you all have been doing with creating rules to the Common Voice website sooner than before, profiting everyone.
If you would like to contribute rule sets for a language that is not covered yet, check the discussion here:
You can find the languages which are being worked on here: https://github.com/Common-Voice/cv-sentence-extractor/pulls
Feel free to ask any questions here or in the linked topic.