Re-running extractions from Wikipedia since last extraction

Hi everyone

I have implemented the necessary steps to allow re-running extractions from Wikipedia through the Sentence Extractor. These steps make sure that only articles created since the last extraction are considered, so that we can still fulfill the legal requirements.

So far I’ve created a PR for Italian, but I’d be happy to run it for other languages that already have a rules file as well. Just write here if you would like me to do that and if you would be willing to do a quick review on it before the PR is created.

Michael

4 Likes

These are great news. Thanks for your work! That’s almost 50 000 new sentences for Italian.

I would love to do a re-run for German and Esperanto. But for both languages I would like to improve the rules-file first, because it wasn’t possible to define an alphabet/a positivlist of letters and signs when the first run happened for these languages. It would be also worth thinking about adding a few more abbreviations to the German rules.

Have you considered yearly re-runs for all languages with a rulefile? This would assure a permanent growth of the sentence corpus for many languages.

Yes, please! I’d be happy to help validate the German rule changes based on the sample extraction.

As this is all automated, this could be run whenever. The only manual work is to download the file from GitHub and create a PR in the Common Voice repository. And even that part, in theory, could be automated quite well.

1 Like

Great :slight_smile: I was wrong, the German file already uses the new allowed symbols regex. But I did a few other fixes and sent a pull request.

1 Like

Michael and I want to improve the English rule file now, to create a showcase for best practices. Right now, the English file is one of the oldest files and does not use the new features of the tool.

The discussion about this happens here: