Adding CoVoST sentences to Common Voice


Facebook CoVoST project translated (by human) Common Voice sentences from some languages to English, and vice versa.

For instance,
Catalan sentences from Common Voice dataset were translated to English, and English sentences from Common Voice were translated to Catalan!

So I wonder if we can import 300,000 new unique sentences in Catalan language from CoVoST2 (CC0 licensed) to Common Voice.

Other languages can reuse CoVoST sentences too

If it’s CC-0, I think it is fine. Only culprit will be reviewing this amount, but on a technical side, push them into the Sentence Collector and go for it :slight_smile:

Putting 300k sentences into Sentence Collector might not really be efficient though. That’s gonna take forever to review. For the Europarl dataset, we’ve come up with a way to review a certain percentage and if that’s ok, we’d take the full dataset. This could then be added directly instead of going through the Sentence Collector. Maybe we could do something similar here? This would also make it easier to do for multiple languages and not just Catalan.

Thanks, @mkohler, I was thinking a direct importing, like Wikipedia sentences. And yes, every language supported by CoVoST could import their sentences, :slight_smile:

I parsed CV 6.0 dataset, released yesterday. Catalan language will soon run out of sentences. So, please, can anyone help us to import CoVoST2 sentences to Common Voice? Thanks in advance

Hi, I am the author of CoVoST and I advocate this proposal. We would like the voices to be collected for translations from CoVoST as well, since it will enable a new application — speech-to-speech translation. This extends the scope of Common Voice to include human-human interaction without language barriers. Please let me know if you need any support on the CoVoST data.

I made a PR with Catalan CoVoST2 sentences

I just parsed Catalan CoVoST2 sentences to normalize and unique them (there are many repeated sentences and different apostrophes are used)

Sentences are translated by humans, and their quality of sentences is good enough.

Looks great! Yeah, we duplicate the translations accordingly for the same sentences (by different speakers) in validated.tsv.