Are these guidelines going to be published somewhere one the main page? Right now they’re very hard to find.
Hi @EwoutH, welcome to the Common Voice community!
This is definitely something we want to see how to best implement on the site in 2020. Right now the development focus is on the infrastructure improvements.
We’ll definitely use this work to inform any changes in 2020.
Thanks!
I approve any recording that is understandable and match the text, including incorrect pronunciations as long as it’s a common one.
It’s an inconvenient truth that any somewhat non-basic word will have multiple pronunciations over the world, but I don’t think keeping the “technically incorrect” ones out of the dataset is the solution, it’s rather something that needs to be handled in a way that accounts for it.
@Michael_Maggs in case you want to update the post with links to localizations to other languages:
Spanish 📕 Guía para validación de grabaciones en common voice
@Michael_Maggs thanks again for your consistent work on this (and to all of you for the input)! I see that there has been a break out for validating sentences (for sentence collector) and that most of the criteria listed here are for the act of validation (/listen). In the thread (a while back now, sheesh time flies) we started chatting about the pros/cons of breaking out criteria for /speak vs. /listen vs. one cohesive list. I now agree that a separate set of criteria for each is the best direction (there may be some overlap of course, such as criteria about background noise). I wonder if I’ve missed a post that is focused on suggested recording criteria (for /speak)? If not, is this something you’re still interested in creating?
Another new contributor here.
I think it is super important that these be shown to new users.
After reading these I’m realizing I’ve been way to lenient validating clips (giving a “thumbs up” to clips where I could just barely tell what a speaker was trying to say, in the interest of accent diversity).
The first two places I looked for some guidelines were the FAQs, then the About page.
Another couple places that may work are:
- Account creation screen
- Bottom of each page, near FAQs, Terms, etc.
I think just an FAQ entry would help a lot of users; I kind of lucked out stumbling across this.
Other questions about validation & quality:
-
Silent time before the speaker starts. Is there some “hard” limit about this, eg “no less than 1 or 2 seconds” or is something alignment is able to fix? And about the end?
-
Audible clicks : In most cases we can here very noisy mouse clicks. Is this a problem?
-
Hesitation: Sometimes, users hesitate in the middle of a sentence. Either at the cost of a small vocal artifact either at the cost of a lengthy silence. Is it something alignment deals with or that should be rejected?
Also a new contributor here and think that due to lack of instruction for validators (such as the ones in OP) the dataset is probably very inconsistent. One plus is that the people who validate a large number of clips are probably more active in the community and have seen this thread.
Can I ask how these guidelines were decided? Who chose them? As someone primarily focused on applications for non-native speakers I think marking any intelligible but incorrect pronunciations as invalid may be a mistake. For instance “rais-ed” although incorrect, is intelligible to native speakers of English and if you’re building a speech recognition system for English, I’d argue that you’d want the system to understand that when someone says “rais-ed”, they mean “raised”. That can only be achieved if examples like that are considered valid in the dataset.
Audible clicks and other noise should be considered valid in my opinion. They don’t affect the labeled transcription and any machine learning models trained on the dataset will learn to ignore the sounds. The same can be said for silence and hesitation. If the dataset doesn’t contain these artifacts, when people use products built using the dataset the products won’t be able to handle those artifacts when they frequently occur in the real world.
But during the learning phase, a clip is split and aligned with words.
That’s were I wonder (@Tilman_Kamp ?) if this could affects negatively learning (implying a prejudicial effect on the final model).
This is gold. Where were these when I was first starting out?
I guess newbies might want to go through these before they even start to dip their toes.
On the other hand, are there any procedures of evaluating beginner contributors? If someone ignores the guidelines systemically (eg. unwittingly) maybe they could use some help.
Hi everyone. Thanks for sharing the new guidelines of this great project, I already have started validating quite a number of voice samples and would like to know if am allowed to record voice samples in languages that are not my native language.
Hi there, if you are a speaker of the language then yes you definitely should record samples. My advice would be, don’t record anything that you don’t know what it means, but if you can understand the sentence, then sure, the more voices the better.
Many thanks for your swift response.
Hi everybody,
Given the complexity of the guidelines, I think only meticulous people should be allowed to review and accept sentences. I am thinking of my own language where you will have many people being able to provide sentences, but when it comes to reviewing, you cannot guarantee that all accepted sentences/recording are actually correct.
My question is, will there be some kind of admin who will do a final reveiw to check quality?
There should even be many levels of control if we want to avoid a low quality dataset.
Cheers
Ibrahima
The quality check is the validation by two other speakers, both in sentence collection and recording. If you have any other specific questions you could make a new topic and tell us a bit about your language and the kind of issues you are finding or expecting to find
Thanks to a translator from the community, the guidelines for recording validation are now also available in Esperanto:
Gvidlinioj por kontroli registraĵojn: https://parolrekonado.github.io/gvidlinioj/
If we should reject stuttered words then wouldn’t this lead the algorithm to have difficulty understanding stutterers? Ideally it should understand them too right?
Another point for the guideline(s) which i discovered during validating in the english section lately: linguistic Filler (filler words)
For example
Text to read:
Today i recorded many sentences for “Common Voice”.
Recorded text:
Today i ehhhhh recorded many sentences for ehhhh “Common Voice”.
I think this is counted under “adding extra words to the sentence” or “trying to say the word twice” categories… At least, I treat them as such (and reject).