Reviewing recordings that are hard to understand, but likely correct?

Many (about half?) of the recordings fall into a strange grey area where:

a) If I listen to the recording while looking at the sentence, I think: they are saying exactly what’s written, but they have a different accent from mine. I will readily believe the pronunciation is correct for their accent.

b) However, if I have someone play the recording for me without showing me the sentence, I can’t fully transcribe it. (I found this a bit surprising. It’s easy to underestimate how much seeing the text primes me.)

I’m not sure how to rate these.

On the one hand, I am very reluctant to say “no”, because I can easily believe that a native speaker with the same accent would have no trouble understanding them. (The accent is usually Indian, though I could imagine the same thing happening with a thick Scottish accent. I don’t know whether these are native English speakers (see: Indian English).)

On the other hand, I’m a little reluctant to say “yes”, because my experiment (b) has shaken my confidence that I can evaluate these. If I hadn’t done that little experiment, I’d probably be just clicking “yes” most of the time.

I think we need to sort out the dialects and accents strategy, so that contributors could identify their accent or dialect in their profile, and then you could limit those dialects and accents which you are able to review. This is a problem for other languages as well: I recall seeing a discussion here related to Brazilian speakers possibly rejecting European Portuguese readings due to their different dialects. In general I think that better metadata on dialects and accents would be beneficial for both data collection and applications.

I do like the idea of reviewing recordings by manual transcription. I don’t know if reviewers would object to the additional work. Maybe it could be an option as a secondary quality control, which would only cover the data as far as volunteers are available. I think it would also be a great way to collect transcriptions of spontaneous speech, which is another very useful type of speech data which is different from read speech.

1 Like

Hey @falsifian and @cjbaker

Thanks for creating the topic.

The community has created guidelines for validating voices to help with reviewing of voices. The guidelines account for accents and pronunciations. A version of this will soon be on the common voice platform.

On the second point, can you provide a bit more clarity, so I know how best to advise. In particular, the part when you say I can’t fully transcribe ?

Regarding Langauge and Accent overall on Common Voice

We want to design a holistic approach to languages and accents that can work across communities. Following community feedback about the current challenges, this is a priority for the 2021/22 roadmap (see post on August open sessions to engage with this!) The team is starting to gather input and insights gathered from research scientists, ML engineers, linguistic experts, and community members to map out new language workflows and accent capture mechanisms. These will be opened up to the community for discussion and user testing, so keep an eye out for those posts!

If you have any other questions, please feel free to ask !

Hi @heyhillary,

I did find those draft guidelines, but couldn’t figure out what they implied in this case.

As an example, I just listed to a recording with the text covered and tried to transcribe it. The best I could do is:

It [tills?] the return match between Barcelona and Real Madrid.

In fact it was:

It tells the return match between Barcelona and Real Madrid.

Probably their pronunciation of “tells” was correct for their dialect.

(I guess one problem here is that the sentence itself is strange, at least to me. I would not expect the word “tells” there.)


Best I could do was:

They are kindly signed in drag city.

In fact it was:

They are currently signed to Drag City.

In hindsight, yes, that’s what they were saying, and I can’t identify any problem with the pronunciation.

I had to comb through many examples (15?) to find those, so it turns out “about half” is an exaggeration.

Hey @falsifian Thanks for sharing the examples.

For the first one, I think the rules regarding varied pronunciations would apply. As you imply that this person has an accent. So I would say, yes to allow for the margin of error.

For the second one, I would say reject it based on the rules (misreading section). As they are adding in words.

I hope this helps, if you have my questions feel free to ask.

@heyhillary to be clear, for the second one, I don’t think they actually said the words “kindly” or “in”. In fact, if I’d seen the text to begin with, I bet I would have heard the correct version, and the other version wouldn’t have occurred to me.

Hi @falsifian

Thanks for the clarification.

It’s hard to give advice on this, as we encourage people to both read the text of the sentence and reviewing the recording to help with the quality of the dataset.

The validation guidelines are based on the comparison of the sentences and the voice clips. in your experience of the platform do the sentences not appear on the screen and can only access the recordings ?

Hello @heyhillary - I think @falsifian is bringing up the “blind” transcription difficulty in order to demonstrate that it’s often very difficult to follow the validation guidelines (“reject even if there are minor errors”) for accents or dialects you’re not familiar with. If the verifier can’t identify the difference between “kindly” and “currently” in the speaker’s accent, then it is impossible to say whether there is a minor error.

When presented with the text during verification, it is easy to bias your verification towards confirming the text, rather than rejecting for minor errors. We often do this in conversation when listening to an unfamiliar accent, as we’re able to fill in the gaps of unexpected or incomprehensible pronunciations, but this seems less than optimal for ensuring the quality of the CommonVoice dataset.

I previously brought up two proposals that could perhaps be considered one day in order to help with this quality problem. One would be to let reviewers list which dialects and accents they’re familiar with, and only present them with sentences to review from these dialects and accents. Another would be to start collecting an additional type of data from reviewers, which would be “blind” manual transcriptions of sentences, as they heard them and unprompted by the text.

@falsifan I think it’s a good idea to practice listening to the audio before looking at the text, in order to reduce the bias of confirming the text. After practicing this skill for some time, you may find that you’re better able to decide when you can’t actually understand every word without looking, in which case I would reject or skip the sentence. It looks like you’re on the right path by taking statistics on your comprehension.

1 Like

It tells the return match between Barcelona and Real Madrid.

This sentence seems ungrammatical to me, although that’s how it appears on Wikipedia. It was probably written by a Spanish speaker “cuenta el partido…” or something like that. I would report those sentences.

In general I think that having a list of accents that can be evaluated by a single speaker makes sense. But also, if there are small issues it does not matter too much for the purposes of ASR. Remember that for ASR we are creating a mapping between audio frames and characters, so if they say “tills” and the text is “tells” it won’t be a big issue.

I also like the possibility of having information about what text users produce for a given audio recording. Something like this could be used for transcribing speech too.

Yes, @cjbaker captured what I mean.

At the risk of making this thread less focussed, I am curious why Mozilla wants a training set of perfectly-pronounced sentences.

Naïvely, if the goal is to make a training set for the task of translating speech to text, wouldn’t you want to capture all attempts to say things, even if they have mistakes? That way the training set would match the task.

For example, if I made a voice-activated assistant, and someone asked it “what tempchur now?”, I’d want it to say “it’s 23 degrees outside”, not “sorry, please use correct pronunciation and grammar”. In order to make the model flexible in this way, I think a realistic training set would be critical.

Of course, sometimes you’ll get something that’s just totally unintelligible. But that’s okay! Nobody expects to get 100% accuracy on ML tasks anyway.

In the quest for accurately labeled data, we definitely need to stop at some point for the current project’s scope. But the thing is, you can add whatever labels you want to a dataset: what they were attempting to say, what they actually said, etc. and make use of them for all different applications. I think the current guideline to reject even minor errors is good, and in line with quality control on many of the hundreds of other commercial and academic speech corpuses which have been developed since the 70s (though most are smaller scale). My view is that if you would have transcribed a totally different word, or truly can’t understand a word, then it’s an error.

As you implied, there are potential applications beyond ASR for dictation, where a simple “reject” for minor errors might deprive the model of realistic examples of speech errors which it should learn to handle. Meanwhile, as word error rates on datasets like LibriSpeech are currently in the 2-3% range, that last 2% really may start to matter, and you might want to work with super-accurate transcriptions.

Thanks for the explanation, @cjbaker. Following what other successful corpora have done does indeed sound like a good idea.