Discussion of new guidelines for recording validation

Generally this is good set of guidelines. Nice work @Michael_Maggs!

However, I note that we should be clear to separate the guidelines for validating text from those for validating speech, the Problems with the written text section is more geared towards text.

One note on the “Ignore minor problems of punctuation if they don’t affect the recording” part…

The example ‘“the giant dinosaurs of the Triassic,’ is given as one in which the punctuation does not effect the reading. This is actually not quite the case.

Think of sentences which include commas. Generally when a comma is used correctly, it indicates a pause. A speech-to-text engine trained on text which uses commas correctly and which is read with the associated pause would learn to insert commas at the appropriate pauses.

However, if sentences similar to the above ‘“the giant dinosaurs of the Triassic,’ were used to train the system, it would never learn to insert commas in the correct place as there would be no correlation between commas and pauses. Sometimes commas would occur at the ends of sentences, sometimes at the start, sometimes randomly within sentences.

So it’s better to reject such sentences as they will cause the engine to have a invalid knowledge of commas.

1 Like

That’s good to know, we’ll make sure the sentence collector cleans up these orphan commas automatically.

1 Like

So we should be refusing voice samples that don’t pause for commas? That’s a lot of them.

There should be an information video for both submitting and validation that tells people this.

1 Like

So it’s better to reject such sentences as they will cause the engine to have a invalid knowledge of commas.

I don’t think it is common for speech recognizers to actually listen to speech timing (prosody) to determine punctuation, it is more commonly done as a post-processing step based solely on the text generated by the recognizer. If I’m not mistaken, the CTC used in Mozilla DeepSpeech excludes punctuation characters, and outputs all lower-case, unpunctuated text. Still though, I think correctly read punctuation is better than incorrect, and maybe some future system will put it to use. I expect this project’s data to be used much more widely than just Mozilla DeepSpeech.

I wonder if we could add a set of checkboxes in the validation interface for some extra annotations: misread punctuation, incorrectly pronounced words, audio problems, etc. Maybe the current “thumbs up / down” could be the default, but a larger annotation interface could be a per-user option? A range of annotations, not just “good/bad”, can be useful for many purposes.


I agree, but we have to think of the future not the present when creating the data set.

In addition, keeping one limited application in mind (Deep Speech) is the wrong way to think about the data set. The data set should retain as much fidelity as possible and users of the data set can choose to not use this fidelity, e.g. throw away punctuation, if they decide to do so. Creating the data set with this in mind will give it maximal utility.


Hi all, what about question marks and the like at the end of sentences? I often hear speakers not respecting them (for instance, to the text “you go first?” they would not raise the tone at the end of the recording, thus making it into a text that would rather look like “you go first !”). Should I still validate the recording, or reject it ?

@Bloubi I think you should accept that. There’s a lot of variety between different nationalities, and a rising tone does not always go with a question mark. Also, sentences like “Well, are you coming?” when said in a cross voice doesn’t have a rising tone.

I’ve edited the draft above to make it clear that we don’t want computer-synthesized voices. See What if people are using text-to-speech to record?

1 Like

I’ve updated the draft at the top of this thread to include the various comments made so far. More feedback from the specialists would be welcome though. Pinging @nukeador, @kdavis, @josh_meyer, @mbranson, @gregor, @mkohler.

I agree that we don’t want to scare off new contributors off by presenting the guidelines up-front as an off-putting wall of text that they have to read. A light-touch way to have them available would be to add to the recording validation page a short line such as:

Unsure? Click the Skip button, or read the guidelines here.

1 Like

Thanks for all the great work here @Michael_Maggs. :slight_smile: In terms of UX I agree with the gist of what you’ve said above. In addition to a call to action link from the Listen page of the contribution portal, we’d want to highlight guidelines on the Speak page too. Making contribution an informed process on both sides of a clip may be most beneficial.

This is content we would also want to make available via the FAQ and through our upcoming About page implementation (where we outline the overall recording and validation process).

In the terms of longer outlook UX goals, we’d ideally create an ‘onboarding flow’ where concepts like this are presented earlier to lower the barrier for new visitors to get started. This is precisely where something like a video could be very effective indeed. :slight_smile: cc @ajay.dixon

@nukeador it’d be most helpful to work with @r_LsdZVv67VKuK6fuHZ_tFpg to get an initial implementation of this guideline work into our sprints when ‘finalized’; e.g. a guideline page w/link from the contribution portal and FAQ.

1 Like

Something to add to the list would be ambiguous sentences.

An example from the current corpus would be:

I only read the quotations.

The sentence has no errors but it’s still unclear what the correct pronunciation is.

1 Like

@dabinat I would think that is an issue that the algorithm simply has to cope with. It’s an annoying feature of the English language that can’t be avoided.

@mbranson Thanks for the feedback. If it would be useful I could work up some similar guidelines to be linked from Speak page. They can be largely based on the same examples, but the focus would need to be slightly different.

I’m wondering if we should tighten up the section of the guideline covering mispronunciations. When validating, I very frequently come across mispronunciations by non-native English speakers. Examples over the last half-hour include

‘bass drop’ (with a short ‘a’, as in the fish) [Bass drop]
‘Hewgee’ [Hughie]
‘jinny pigs’ [guinea pigs]
‘chemical’ (with the ‘ch’ pronounced as in chalk) [chemical]
‘contain-ed’ (three syllables) [contained]
‘It bet me’ [It beat me]
‘laat’ [laughed]
‘calorim-Ater’ [calorimeter]
‘knitting’ (with the k and n both pronounced) [knitting]

While I can’t be absolutely sure that these aren’t valid pronunciations that I’ve simply never heard before, it seems more likely that the reader is simply guessing.

How should these be handled? What do others do?


I tend to click ‘no’ and move on for extreme mispronounced words. I’m of the opinion that soon enough, another speaker from their nationality will submit a correct recording.

I would click no for all of your examples, maybe the bass drop & calorimeter might be ok.

Any ambiguous pronunciation where I’m not sure, I click ‘skip’. That doesn’t happen very often though.

Common Voice is intended to provide greater representation for a diverse range of accents, so I don’t think native English speakers should solely dictate what is/isn’t acceptable pronunciation.

I tend to reject pronunciations that are extreme or sound too similar to a different word and I skip if I’m not sure, otherwise I try to be generous.

But I agree that the majority of the examples shown should be rejected.

1 Like

@mbranson Thanks for the feedback. If it would be useful I could work up some similar guidelines to be linked from Speak page. They can be largely based on the same examples, but the focus would need to be slightly different.

Makes sense @Michael_Maggs, they are indeed different interactions but I’d be cautious to overwhelm with the amount of information provided. How might we convey guidelines for both Speak and Listen in one document so they complement and inform one another? cc @nukeador

Thanks @mbranson. I’ve been thinking about how best to achieve that, and I’m not entirely sure how it would work. Some of the explanations will inevitably have to be different for speaking and reviewing, and putting both into the same document would make it quite unwieldy I’d have thought. I suppose one could have instructions in two columns, with common examples, but it won’t be very user-friendly. Or separate explanatory texts with links to a common set of examples, but that would require multiple click-throughs. Did you have something specific in mind?

How do you deal with situations where the person hesitates and elongates the word? DeepSpeech does need to be able to deal with drawn-out or over-emphasized letters after all.

I generally approve as long as the person doesn’t break the word.


“I wasn’t s…sure” = reject
“I wasn’t sssssure” = approve

I just wanted to see how others dealt with it. It’s helpful if we’re all operating by the same rules.

I do the same as you. Accept if it’s an elongation; reject if the reader takes two attempts to start the word.