Sentence Extraction now automated

To extract sentences from a data source such as Wikipedia (three sentences per article maximum) we have the Sentence Extractor script. This can extract sentences from Wikipedia and - in theory - other compatible data sources (see details about data sources).

Until recently the extraction had to be run manually once the rules were approved. This took quite some time, as somebody had to run the script on their computer - and therefore had to be prioritized depending on other open tasks. Since yesterday this is mostly automated. Whenever we approve a rule set and merge the Pull Request, we can mark it so it automatically starts a job to run the full script. All we need to do manually (for now? ;)) is to download the resulting file and create a Pull Request for it to be included in the Common Voice repository. This allows us to bring the output of work you all have been doing with creating rules to the Common Voice website sooner than before, profiting everyone.

If you would like to contribute rule sets for a language that is not covered yet, check the discussion here:

You can find the languages which are being worked on here: https://github.com/Common-Voice/cv-sentence-extractor/pulls

Feel free to ask any questions here or in the linked topic.

Michael

2 Likes

Thanks for this Michael.

I would like to point out how important this improvement is, lowering the technical barrier to extract high amount of sentences from different sources is fundamental to speed up the process and ensure our languages have enough space for new sentences to record on the site.

Thanks also everyone who has been extracting and helping with the script in the past months, your feedback has been fundamental to keep improving :slight_smile:

Very cool, thanks so much for your leadership on this :sparkling_heart:

This is great and a big step forward.

One thought, now that we do have an easy way to automate the extraction: Many languages are gaining tenth or even hundred of thousands new articles on Wikipedia every year. Maybe we find a way to extract sentences only from articles that are younger than the last extraction date? This could assure a permanent addition of new sentences into many languages over the years.

One thought, now that we do have an easy way to automate the extraction: Many languages are gaining tenth or even hundred of thousands new articles on Wikipedia every year. Maybe we find a way to extract sentences only from articles that are younger than the last extraction date? This could assure a permanent addition of new sentences into many languages over the years.

Absolutely, I’ve already looked into that. We would need to change the WikiExtractor to provide us with the creation date of the article though. Currently the output from WikiExtractor does not include that. Will look into that when I get time.

2 Likes