Adding CoVoST sentences to Common Voice

jmontane · October 6, 2020, 10:19am

Hi,

Facebook CoVoST project translated (by human) Common Voice sentences from some languages to English, and vice versa.

For instance,
Catalan sentences from Common Voice dataset were translated to English, and English sentences from Common Voice were translated to Catalan!

So I wonder if we can import 300,000 new unique sentences in Catalan language from CoVoST2 (CC0 licensed) to Common Voice.

Other languages can reuse CoVoST sentences too

lissyx · October 6, 2020, 10:43am

If it’s CC-0, I think it is fine. Only culprit will be reviewing this amount, but on a technical side, push them into the Sentence Collector and go for it

mkohler · October 6, 2020, 7:25pm

Putting 300k sentences into Sentence Collector might not really be efficient though. That’s gonna take forever to review. For the Europarl dataset, we’ve come up with a way to review a certain percentage and if that’s ok, we’d take the full dataset. This could then be added directly instead of going through the Sentence Collector. Maybe we could do something similar here? This would also make it easier to do for multiple languages and not just Catalan.

jmontane · December 17, 2020, 12:05pm

Thanks, @mkohler, I was thinking a direct importing, like Wikipedia sentences. And yes, every language supported by CoVoST could import their sentences,

I parsed CV 6.0 dataset, released yesterday. Catalan language will soon run out of sentences. So, please, can anyone help us to import CoVoST2 sentences to Common Voice? Thanks in advance

Changhan_Wang · January 8, 2021, 6:05am

Hi, I am the author of CoVoST and I advocate this proposal. We would like the voices to be collected for translations from CoVoST as well, since it will enable a new application — speech-to-speech translation. This extends the scope of Common Voice to include human-human interaction without language barriers. Please let me know if you need any support on the CoVoST data.

jmontane · January 8, 2021, 10:42am

I made a PR with Catalan CoVoST2 sentences

I just parsed Catalan CoVoST2 sentences to normalize and unique them (there are many repeated sentences and different apostrophes are used)

Sentences are translated by humans, and their quality of sentences is good enough.

Changhan_Wang · January 8, 2021, 4:12pm

Looks great! Yeah, we duplicate the translations accordingly for the same sentences (by different speakers) in validated.tsv.

moonhouse · April 23, 2021, 4:18pm

I see that CoVoST sentences were added for Swedish last week and is shown in Common Voice. Were they tested in any way if they conform to the standard we have for manually added sentences? Spontaneously the quality of this look worse than the Europarl sentences I have prepared (but still haven’t gotten around to get anyone else to quality check for me).

Right now I get the impression that I wouldn’t finish a set of five sentences without happen upon at least one problematic sentence either because of incorrect grammar or a lot of words that would be pronounced in English.

I looked into this because I got the following sentence to listen to

Vissa skärmdumpar av onlinedokument sk. thumbshots förbättras med informationsikoner eller korta textsammanfattningar.

which have the abbreviation “sk.” which would be read out as “så kallade” and the word “thumbshots” which would be pronounced in English and not Swedish.

There are other sentences which contain even more words that would typically be read out with English pronunciation instead of Swedish. Such as:

Hertigen är sponsor till The Northumberland Church of England Academy. (6/10 words in English)

Harmonic society choral (3/3 words in English, and not a real sentence either)

Mark Twain National Forest innehåller ytterligare allmän mark som bl.a. Bell Mountain Wilderness. (7/13 words, also note the abbreviation “bl.a.”)

Wyoming Women's Center, som är en del av Wyoming Department of Corrections, ligger i Lusk. (7/15 words)

“Källa: China Social Science Network" (4/5 words, also using quotes unnecessarily)

10556/263556 lines in the file contains quotes which do look a bit high to my eyes. (In addition to that 8844 lines contains “typographical” quotes which haven’t been normalized to the other type.)

Caesar sallad hör inte hemma på en pizza även om den är vegansk. (all words could be pronounced in Swedish but the first two words should be written together or at least hyphenated to be correct Swedish)

Begäran filtreras av umask. (so umask is probably the unix command here and nothing we can expect users to even begin to pronounce)

Cryer stödde ett flertal vänsterpolitiska ändamål och han var även EU-skeptisk. (EU-skeptisk here contains the acronym EU which I don’t think Sentence Collector would allow?)

Of the most common abbreviations in Sweden there are:
37 with “bl.a.”
11 with “osv.”
15 with “etc.”
24 with “s.k.” or “sk.”
124 with “t.ex.”
10 with “pga.” or “p.g.a.”

So not too many sentences with abbreviations but something that a native speaker would have checked before adding.

There also seems to be trouble with grammar.

Staden var ursprungligen en etrusk och osker bosättning.

feels like an incorrect automatic translation of “The city was originally an Etruscan-Oscan settlement.” where the two adjectives are wrong.

Ni har gjort själv inte bra av det. which I have trouble even to understand what it is supposed to mean but probably needs to have some words rearranged and some words added to make any sense.

moonhouse · April 23, 2021, 4:48pm

There are also sentences in covost-en_sv-SE.txt that are purely in English which makes me think there were some error in the preprocessing step. I found the following sentences by searching for the word “about”.

"I heard what you were talking about the other day with the alchemist," the wind said.
"Tell me more about your dream," said the woman.
And it knew nothing about love.
But he didn't need to worry about that right now.
He remembered what the old man had said about offering something you didn't even have yet.
He's nuts about you.
How about Egypt?
I will think about what to buy.
Many of them had been right about what they said, while some had been wrong.
Never mind about that.
Sometime during the second year, you'll remember about the treasure.
That's what I came to talk to you about.
The barrow and its surrounding ditch are well-preserved, about in diameter and high.
The infidels had an evil look about them.
They talk about learning from one's mistakes, but they won't admit to their own mistakes.
This building has about ten stories.
To understand recursion, one must first learn about recursion.
What I can say about the movie is that its effects are very great and awesome.
What is it all about?
What on earth are you rambling about?
What was it all about?

The commit on Github mentions using tools for detecting language but forgoing it.

I tried langid and fastText but after checking a few examples, I noticed that LID filtering would discard good translations in the correct language so I didn’t apply LID (since the translations are done by humans, occurrences of wrong language are rare).

In this case all of those sentences would have been flagged as English by langid. Maybe it would be better to miss some sentences in the correct language if we can avoid most of the sentences that are in the incorrect language?