Using the Europarl Dataset with sentences from speeches from the European Parliament

mkohler · July 3, 2020, 4:33pm

Extracting all might get tricky, but if languages have an existing rules file for the Sentence Extractor I can see this working quite nicely. As the Sentence Collector would in theory support adding a new source, I’d say we should go down that road. More info on that here: https://github.com/Common-Voice/cv-sentence-extractor#adding-another-scrape-target

I haven’t looked yet at the data source structure, but if it’s straightforward, I’d be fine with adding the fetch and prepare code in the sentence-extractor as well, so that everything is in one place.

The other question then is how to trigger the extraction job. For now we only trigger it on merges, but don’t have a trigger for anything that wouldn’t have a PR. I’ll think about that.