Discussion of new guidelines for recording validation

Common Voice is intended to provide greater representation for a diverse range of accents, so I don’t think native English speakers should solely dictate what is/isn’t acceptable pronunciation.

I tend to reject pronunciations that are extreme or sound too similar to a different word and I skip if I’m not sure, otherwise I try to be generous.

But I agree that the majority of the examples shown should be rejected.

@mbranson Thanks for the feedback. If it would be useful I could work up some similar guidelines to be linked from Speak page. They can be largely based on the same examples, but the focus would need to be slightly different.

Makes sense @Michael_Maggs, they are indeed different interactions but I’d be cautious to overwhelm with the amount of information provided. How might we convey guidelines for both Speak and Listen in one document so they complement and inform one another? cc @nukeador

Thanks @mbranson. I’ve been thinking about how best to achieve that, and I’m not entirely sure how it would work. Some of the explanations will inevitably have to be different for speaking and reviewing, and putting both into the same document would make it quite unwieldy I’d have thought. I suppose one could have instructions in two columns, with common examples, but it won’t be very user-friendly. Or separate explanatory texts with links to a common set of examples, but that would require multiple click-throughs. Did you have something specific in mind?

How do you deal with situations where the person hesitates and elongates the word? DeepSpeech does need to be able to deal with drawn-out or over-emphasized letters after all.

I generally approve as long as the person doesn’t break the word.


“I wasn’t s…sure” = reject
“I wasn’t sssssure” = approve

I just wanted to see how others dealt with it. It’s helpful if we’re all operating by the same rules.

I do the same as you. Accept if it’s an elongation; reject if the reader takes two attempts to start the word.

This conversation is great, thanks everyone involved.

@Michael_Maggs would you be interested in maintaining an updated first post here with all the suggestions we have been hearing so once we feel comfortable (and maybe signaling must-have vs nice-to-have) so we can in the future run an exercise with our great User Experience team to turn that list into something more visual and fast to visualize for the site? :smiley:


Yes I’d be happy to do that. Will update in the next day or two.

Perhaps it would make sense to for me to separate out into a new thread guidelines for validation of uploaded sentences, as those will mostly be of use in the Sentence Collector.


I’m curious what fellow validators think about this: https://github.com/mozilla/voice-web/issues/1927

I’ll quote your message there for reference

Recently a change was made to the site to list sentences with the fewest recordings first in order to add more unique sentences for the DeepSpeech model. I think that this was a good idea overall, however I’m starting to see something that could be a problem.

Some users are recording a LOT of sentences. In fact, over the past few days I have validated around 1500-2000 clips and I would estimate around 70% of them were recorded by the same user, all of which were unique sentences.

I’m sure that the DeepSpeech team makes certain that there aren’t too many recordings by a single user, so these sentences will most likely be discarded until there are more recordings available. But if the site shows sentences with the fewest recordings first, it will have to go through the thousands of unrecorded sentences to get to that point again, which may never happen if more sentences keep getting added.

The DeepSpeech team said they don’t want more than a few hundred recordings from any one user. So a user with 5000 recordings may have prevented 4700 sentences from making it into the model.

So I think the solution to this is either:

Put a hard limit on the total number of recordings users can make or have a daily per-user limit.

Change the algorithm so that each sentence has, say, 3 recordings minimum before it’s given a lower priority in the queue.

In the coming weeks we will be working on a few experiments involving personal goals and also invite more people to contribute since, as we have commented in the past, diversity is super important.

FYI, I edited the post since then to clarify a couple of things I thought were unclear, but probably most relevant is that I thought of an additional option:

  1. Deprioritize recordings in the validation queue by users who have x number of validated recordings. So we’re prioritizing unique users AND unique sentences.

In my opinion some combination of options 2 and 3 is best.

Since I prepared the draft guidelines at the top of this thread the number of sentences for recording verification that have errors in the written text has much decreased - probably as most have now gone through the sentence collector.

To avoid overloading the recording reviewer guidelines I’d now suggest removing the Problems with the written text section entirely. It can be moved over into a new guideline for sentence reviewers. At present all they have is a single paragraph on this page: https://common-voice.github.io/sentence-collector/#/how-to.

I’ve moved the sentence review guidelines over to this thread: Discussion of new guidelines for uploaded sentence validation

Hello. What should I do if the audio/voice is generated by robots(TTS)? :sweat_smile:

You should reject it. If you’re finding large numbers of instances, it would be worth posing a few examples here so that the developers can remove the whole batch by script, if need be.

1 Like

I wonder if it should be the mandatory registration and there would be a moderator to avoid funny things that give a lot of work thanks

How strict should I be with possessives? An example sentence would be:

James’s car was stolen.

But the following is also valid in the English language:

James’ car was stolen.

Sometimes users pronounce that extra S when it’s not there and don’t pronounce it when it is. I am generally generous when it comes to mispronunciations if they are common (e.g. American mispronunciations of British names like “Warwickshire”) and I feel like this could be classed as a particular reader’s “style”. I have not formulated any particular policy on this and whether I approve or reject has so far depended on how generous I felt at that particular moment.

Just wanted to check what others think is appropriate or if it even matters.

I would argue to be valid the sentence has to be read as written. If a reader changes what is written, that may be comprehensible of course, but to me that goes beyond what’s reasonable to consider merely the reader’s ‘style’. It’s the same as rejecting the reading “Weren’t” when the sentence actually says “Were not”.

Mispronunciations of proper names is a difficult one. Many errors will inevitably get through as we’re not using solely British speakers to verify British terms nor solely American speakers to verify American ones. I don’t know if it’s the best approach, but what I do is to reject any bad pronunciation that I know to be wrong, and allow through minor things such as slightly odd stress patterns. I’d reject “war-wick-shire” for Warwickshire as well as “ar-kan-sus” for Arkansas.

I’ve been thinking a bit more about local place names that have unusual pronunciations. Since the whole point of the project is to allow users to speak to their computer, we ought to make sure the computer’s pronunciation of a local or regional term is the same as that of the class of people who will frequently want to use it in dictation or whatever. For example, 99%+ of users who use the word ‘Warwickshire‘ frequently and who will expect their computer to be able to understand it will be Brits, so we really ought to make sure their pronunciation is in the database. Having the guessed-but-incorrect “war-wick-shire” is worse than not having the word at all, as 99%+ of future users who need the word will have to teach their computer to unlearn it. The problem is particularly acute as only one reading of each sentence is being accepted for the corpus, so there will be many words where training is based only on a single reading by a single individual.

I wonder what people would think about modifying the guidelines at the top of this thread (Discussion of new guidelines for recording validation) to cover this?

As a new user, I was very glad to find these guidelines. I agree about the wall-of-words problem. It would be most helpful if they were available with the FAQs say - their absence brought me here.

As for Warwickshire, to me, that sounds like one for the “too hard” pile. Even among Brits, is there really but one pronunciation? And how would that be picked out? What about other cases: “St. John” as “sinjin”? “Worcestershire sauce” as it looks, as “woster sauce”, “wooster sauce”? I don’t know enough about the corpus, but if the same word appears in different sentences, wouldn’t the variants be there?

Re commas: Comma usage is highly personal - for examples, search “Oxford comma” or check Lynn Truss’s fun book about punctuation, “Eats, Shoots and Leaves” (the title refers to koalas) . Usage may also vary with context - business, fiction, text to be read aloud, scripts, and so on. I dictate them explicitly when using voice-to-text (Dragon, MSSR for example) and can’t imagine it working well otherwise.

Finally, should recordings that leave a very long silence at the outset but that are otherwise correct be accepted?

Thanks for putting these together! They really helped answer several questions I had.

Yes, this is fine. The algorithm is designed to deal with this.