Re-running extractions from Wikipedia since last extraction

mkohler · October 11, 2021, 3:57pm

Hi everyone

I have implemented the necessary steps to allow re-running extractions from Wikipedia through the Sentence Extractor. These steps make sure that only articles created since the last extraction are considered, so that we can still fulfill the legal requirements.

So far I’ve created a PR for Italian, but I’d be happy to run it for other languages that already have a rules file as well. Just write here if you would like me to do that and if you would be willing to do a quick review on it before the PR is created.

Michael

stergro · October 12, 2021, 12:32pm

These are great news. Thanks for your work! That’s almost 50 000 new sentences for Italian.

I would love to do a re-run for German and Esperanto. But for both languages I would like to improve the rules-file first, because it wasn’t possible to define an alphabet/a positivlist of letters and signs when the first run happened for these languages. It would be also worth thinking about adding a few more abbreviations to the German rules.

Have you considered yearly re-runs for all languages with a rulefile? This would assure a permanent growth of the sentence corpus for many languages.

mkohler · October 12, 2021, 4:15pm

Yes, please! I’d be happy to help validate the German rule changes based on the sample extraction.

As this is all automated, this could be run whenever. The only manual work is to download the file from GitHub and create a PR in the Common Voice repository. And even that part, in theory, could be automated quite well.

stergro · October 12, 2021, 7:55pm

Great I was wrong, the German file already uses the new allowed symbols regex. But I did a few other fixes and sent a pull request.

stergro · October 14, 2021, 5:41pm

Michael and I want to improve the English rule file now, to create a showcase for best practices. Right now, the English file is one of the oldest files and does not use the new features of the tool.

The discussion about this happens here:

github.com/common-voice/cv-sentence-extractor

Use best practice in English rules file

opened 11:20PM - 11 Oct 21 UTC

closed 03:30PM - 20 Nov 21 UTC

MichaelKohler

enhancement P1 rules

Over the past years there were quite a few changes to the possible rule configur…ations. The English rules file still uses some older rules that should now be replaced by better approaches, such as replacing `disallowed_symbols` with `allowed_symbols`. As most new rule files get copied from EN, this would improve overall quality quite substantially.

Topic		Replies	Views
Sentence Extraction now automated Common Voice	4	1357	March 19, 2020
Sentence Extractor - Current Status and Workflow Summary Common Voice sentence-collection	4	3513	July 26, 2020
Future of the Sentence Extractor - Your input is required Common Voice sentence-collection	11	1863	May 28, 2021
Question about CV Sentence Extractor quality and your experience Common Voice	18	1612	August 30, 2023
[Technical feedback needed] Wikipedia extractor script beta Common Voice sentence-collection , feedback	75	8987	July 1, 2020

Re-running extractions from Wikipedia since last extraction

Related topics