Discussion of new guidelines for recording validation

You should reject it. If you’re finding large numbers of instances, it would be worth posing a few examples here so that the developers can remove the whole batch by script, if need be.


I wonder if it should be the mandatory registration and there would be a moderator to avoid funny things that give a lot of work thanks

How strict should I be with possessives? An example sentence would be:

James’s car was stolen.

But the following is also valid in the English language:

James’ car was stolen.

Sometimes users pronounce that extra S when it’s not there and don’t pronounce it when it is. I am generally generous when it comes to mispronunciations if they are common (e.g. American mispronunciations of British names like “Warwickshire”) and I feel like this could be classed as a particular reader’s “style”. I have not formulated any particular policy on this and whether I approve or reject has so far depended on how generous I felt at that particular moment.

Just wanted to check what others think is appropriate or if it even matters.

I would argue to be valid the sentence has to be read as written. If a reader changes what is written, that may be comprehensible of course, but to me that goes beyond what’s reasonable to consider merely the reader’s ‘style’. It’s the same as rejecting the reading “Weren’t” when the sentence actually says “Were not”.

Mispronunciations of proper names is a difficult one. Many errors will inevitably get through as we’re not using solely British speakers to verify British terms nor solely American speakers to verify American ones. I don’t know if it’s the best approach, but what I do is to reject any bad pronunciation that I know to be wrong, and allow through minor things such as slightly odd stress patterns. I’d reject “war-wick-shire” for Warwickshire as well as “ar-kan-sus” for Arkansas.

I’ve been thinking a bit more about local place names that have unusual pronunciations. Since the whole point of the project is to allow users to speak to their computer, we ought to make sure the computer’s pronunciation of a local or regional term is the same as that of the class of people who will frequently want to use it in dictation or whatever. For example, 99%+ of users who use the word ‘Warwickshire‘ frequently and who will expect their computer to be able to understand it will be Brits, so we really ought to make sure their pronunciation is in the database. Having the guessed-but-incorrect “war-wick-shire” is worse than not having the word at all, as 99%+ of future users who need the word will have to teach their computer to unlearn it. The problem is particularly acute as only one reading of each sentence is being accepted for the corpus, so there will be many words where training is based only on a single reading by a single individual.

I wonder what people would think about modifying the guidelines at the top of this thread (Discussion of new guidelines for recording validation) to cover this?

As a new user, I was very glad to find these guidelines. I agree about the wall-of-words problem. It would be most helpful if they were available with the FAQs say - their absence brought me here.

As for Warwickshire, to me, that sounds like one for the “too hard” pile. Even among Brits, is there really but one pronunciation? And how would that be picked out? What about other cases: “St. John” as “sinjin”? “Worcestershire sauce” as it looks, as “woster sauce”, “wooster sauce”? I don’t know enough about the corpus, but if the same word appears in different sentences, wouldn’t the variants be there?

Re commas: Comma usage is highly personal - for examples, search “Oxford comma” or check Lynn Truss’s fun book about punctuation, “Eats, Shoots and Leaves” (the title refers to koalas) . Usage may also vary with context - business, fiction, text to be read aloud, scripts, and so on. I dictate them explicitly when using voice-to-text (Dragon, MSSR for example) and can’t imagine it working well otherwise.

Finally, should recordings that leave a very long silence at the outset but that are otherwise correct be accepted?

Thanks for putting these together! They really helped answer several questions I had.

Yes, this is fine. The algorithm is designed to deal with this.

Hi, I’m happy to finally find the guidelines. I believe they should be added somewhere on the main site. Phrases that I found there, “Users validate the accuracy of donated clips, checking that the speaker read the sentence correctly” and “did they accurately speak the sentence?” are just too vague.
I understand that the extensive guidelines are hidden here to avoid scaring new users, but I think these should be available on the main site for users who prefers to be precise in their decisions in validating/rejecting clips.

Misspelling, “out” is written twice in the guideline.

Well spotted! I’ve made the correction.

Are these guidelines going to be published somewhere one the main page? Right now they’re very hard to find.

1 Like

Hi @EwoutH, welcome to the Common Voice community!

This is definitely something we want to see how to best implement on the site in 2020. Right now the development focus is on the infrastructure improvements.

We’ll definitely use this work to inform any changes in 2020.


I approve any recording that is understandable and match the text, including incorrect pronunciations as long as it’s a common one.

It’s an inconvenient truth that any somewhat non-basic word will have multiple pronunciations over the world, but I don’t think keeping the “technically incorrect” ones out of the dataset is the solution, it’s rather something that needs to be handled in a way that accounts for it.

@Michael_Maggs in case you want to update the post with links to localizations to other languages:

Spanish 📕 Guía para validación de grabaciones en common voice

@Michael_Maggs thanks again for your consistent work on this (and to all of you for the input)! I see that there has been a break out for validating sentences (for sentence collector) and that most of the criteria listed here are for the act of validation (/listen). In the thread (a while back now, sheesh time flies) we started chatting about the pros/cons of breaking out criteria for /speak vs. /listen vs. one cohesive list. I now agree that a separate set of criteria for each is the best direction (there may be some overlap of course, such as criteria about background noise). I wonder if I’ve missed a post that is focused on suggested recording criteria (for /speak)? If not, is this something you’re still interested in creating?

Another new contributor here.

I think it is super important that these be shown to new users.

After reading these I’m realizing I’ve been way to lenient validating clips (giving a “thumbs up” to clips where I could just barely tell what a speaker was trying to say, in the interest of accent diversity).

The first two places I looked for some guidelines were the FAQs, then the About page.

Another couple places that may work are:

  • Account creation screen
  • Bottom of each page, near FAQs, Terms, etc.

I think just an FAQ entry would help a lot of users; I kind of lucked out stumbling across this.


Other questions about validation & quality:

  • Silent time before the speaker starts. Is there some “hard” limit about this, eg “no less than 1 or 2 seconds” or is something alignment is able to fix? And about the end?

  • Audible clicks : In most cases we can here very noisy mouse clicks. Is this a problem?

  • Hesitation: Sometimes, users hesitate in the middle of a sentence. Either at the cost of a small vocal artifact either at the cost of a lengthy silence. Is it something alignment deals with or that should be rejected?

Also a new contributor here and think that due to lack of instruction for validators (such as the ones in OP) the dataset is probably very inconsistent. One plus is that the people who validate a large number of clips are probably more active in the community and have seen this thread.

Can I ask how these guidelines were decided? Who chose them? As someone primarily focused on applications for non-native speakers I think marking any intelligible but incorrect pronunciations as invalid may be a mistake. For instance “rais-ed” although incorrect, is intelligible to native speakers of English and if you’re building a speech recognition system for English, I’d argue that you’d want the system to understand that when someone says “rais-ed”, they mean “raised”. That can only be achieved if examples like that are considered valid in the dataset.

Audible clicks and other noise should be considered valid in my opinion. They don’t affect the labeled transcription and any machine learning models trained on the dataset will learn to ignore the sounds. The same can be said for silence and hesitation. If the dataset doesn’t contain these artifacts, when people use products built using the dataset the products won’t be able to handle those artifacts when they frequently occur in the real world.

But during the learning phase, a clip is split and aligned with words.
That’s were I wonder (@Tilman_Kamp ?) if this could affects negatively learning (implying a prejudicial effect on the final model).