Adding CoVoST sentences to Common Voice


Facebook CoVoST project translated (by human) Common Voice sentences from some languages to English, and vice versa.

For instance,
Catalan sentences from Common Voice dataset were translated to English, and English sentences from Common Voice were translated to Catalan!

So I wonder if we can import 300,000 new unique sentences in Catalan language from CoVoST2 (CC0 licensed) to Common Voice.

Other languages can reuse CoVoST sentences too

If it’s CC-0, I think it is fine. Only culprit will be reviewing this amount, but on a technical side, push them into the Sentence Collector and go for it :slight_smile:

Putting 300k sentences into Sentence Collector might not really be efficient though. That’s gonna take forever to review. For the Europarl dataset, we’ve come up with a way to review a certain percentage and if that’s ok, we’d take the full dataset. This could then be added directly instead of going through the Sentence Collector. Maybe we could do something similar here? This would also make it easier to do for multiple languages and not just Catalan.