Sentence Extraction now automated

mkohler · March 17, 2020, 9:10pm

To extract sentences from a data source such as Wikipedia (three sentences per article maximum) we have the Sentence Extractor script. This can extract sentences from Wikipedia and - in theory - other compatible data sources (see details about data sources).

Until recently the extraction had to be run manually once the rules were approved. This took quite some time, as somebody had to run the script on their computer - and therefore had to be prioritized depending on other open tasks. Since yesterday this is mostly automated. Whenever we approve a rule set and merge the Pull Request, we can mark it so it automatically starts a job to run the full script. All we need to do manually (for now? ;)) is to download the resulting file and create a Pull Request for it to be included in the Common Voice repository. This allows us to bring the output of work you all have been doing with creating rules to the Common Voice website sooner than before, profiting everyone.

If you would like to contribute rule sets for a language that is not covered yet, check the discussion here:

You can find the languages which are being worked on here: Pull requests · common-voice/cv-sentence-extractor · GitHub

Feel free to ask any questions here or in the linked topic.

Michael

nukeador · March 18, 2020, 11:41am

Thanks for this Michael.

I would like to point out how important this improvement is, lowering the technical barrier to extract high amount of sentences from different sources is fundamental to speed up the process and ensure our languages have enough space for new sentences to record on the site.

Thanks also everyone who has been extracting and helping with the script in the past months, your feedback has been fundamental to keep improving

phirework · March 18, 2020, 3:55pm

Very cool, thanks so much for your leadership on this

stergro · March 19, 2020, 6:32pm

This is great and a big step forward.

One thought, now that we do have an easy way to automate the extraction: Many languages are gaining tenth or even hundred of thousands new articles on Wikipedia every year. Maybe we find a way to extract sentences only from articles that are younger than the last extraction date? This could assure a permanent addition of new sentences into many languages over the years.

mkohler · March 19, 2020, 6:45pm

One thought, now that we do have an easy way to automate the extraction: Many languages are gaining tenth or even hundred of thousands new articles on Wikipedia every year. Maybe we find a way to extract sentences only from articles that are younger than the last extraction date? This could assure a permanent addition of new sentences into many languages over the years.

Absolutely, I’ve already looked into that. We would need to change the WikiExtractor to provide us with the creation date of the article though. Currently the output from WikiExtractor does not include that. Will look into that when I get time.

Topic		Replies	Views
[Technical feedback needed] Wikipedia extractor script beta Common Voice sentence-collection , feedback	76	8349	July 1, 2020
Sentence Extractor - Current Status and Workflow Summary Common Voice sentence-collection	4	3403	July 26, 2020
[Common Voice] Technical help needed to grow our sentence diversity DeepSpeech	0	933	July 30, 2019
Bulk sentences submission from Wikipedia Common Voice sentence-collection	4	556	August 12, 2024
釋出維基百科擷取 script 華語（台灣） (zh-TW)	0	1345	July 25, 2019

Sentence Extraction now automated

Related topics