If we should reject stuttered words then wouldn’t this lead the algorithm to have difficulty understanding stutterers? Ideally it should understand them too right?
Another point for the guideline(s) which i discovered during validating in the english section lately: linguistic Filler (filler words)
For example
Text to read:
Today i recorded many sentences for “Common Voice”.
Recorded text:
Today i ehhhhh recorded many sentences for ehhhh “Common Voice”.
I think this is counted under “adding extra words to the sentence” or “trying to say the word twice” categories… At least, I treat them as such (and reject).
@bozden I agree with you, this should be rejected.
To my knowledge, and personal opinion: AI has a finite memory (dictionary), which consists of most frequent words and sub words, so the data should hold at most value.
For example “ehhhh” has no direct meaning that relates to the sentence, therefore it should be considered as noise.
If the goal is to build a high quality dataset, which should be the case, then high standards should be applied on filtering out such cases.
Noise is also important, but this should not be part of the actual dataset that we are trying to build.
Noise can later be added syntheticlly to have a more robust AI model.
@daniel.abzakh, I’m not strict on this. Some texts may include such words and they must be spoken as they are written. Some conversations in novels, transcripts from oral history interviews do include that as hesitation for example.
Here, the “ehhhh” were spoken out, but it doesn’t exist in the text.
I agree.
Same here, i also pressed no during validation.
I disagree, it is hard to add authentic noise, adding static or mixing in sounds is fine, but if you do that then you miss the frequency effects of actual human experience.
About the ehhhh example I have no clear feelings, I don’t think it would particularly help or harm the model in low enough quantities.
Would you agree that removing noise is harder than adding noise?
Could you elaborate on this point?
This sounds to me “if it can be avoided, then it will be better”.
I have not tried to train STT models, but I can imagine it is sensitive to data quality just as NMT models.
Your input is appreciated!
Less so because of the way that the task works. Imagine this, “p” is closer to “b” than to “k”, but “pin” is not semantically closer to “bin” than to “kin”. It’s more like OCR than NMT.
I more frequently hear the wind and birds tweeting than sounds of elephants roaring or large explosions.
For the purposes of training ASR systems, removing the noise is not necessary.
I disagree. If you want to make a LibriSpeech-style corpus sure, but if you want to make a corpus that works when people are driving down the road in rural Chuvashia, you’d better have road and car noise in the dataset.
Thank you Francis.
I’ll keep those notes in mind.
Regular breaks i would also mention.
For validation but also and especially for the recording process.
With lack of concentration the contributor hears or records everything else, but not the shown sentence.
We have two major problems with these guidelines:
- They do not include every possible scenario. They even may be language-specific and there is no room to give further info (except in a sub-Discourse, if there is a community).
- But more importantly, nobody reads them!
In my opinion, these guidelines should not be voluntary reading, but must be presented as a contract. Even a compulsory test can be a good idea…
- They do not include every possible scenario.
- They do not have to - just the most common
They even may be language-specific and there is no room to give further info (except in a sub-Discourse, if there is a community).
- Changing to language specific rules and displaying them.
- But more importantly, nobody reads them!
- I read them (hopefully you and some others too, so this statement is incorrect.
In my opinion, these guidelines should not be voluntary reading, but must be presented as a contract.
- based on your scenario in point2: who is reading the contract? One click on accept, behavior as before changes not very much.
Even a compulsory test can be a good idea…
- And if the contributor fails the test leads to. …
Inculding the language specific rules via sentence collector for reading and recording.
(You have also some cc0 sentences for newly started languages/dialects btw.)
I even read every post in Discourse, but I’m a nerd
You are right in some aspects, but I stand my ground…
We are preparing a large-scale campaign and it will be not clear for the general public what to accept, what to reject. There are many nuances and I worry about the dataset quality. Therefore we are preparing a Youtube channel where we publish video guides and short talks to give the idea to non-document-readers…
Referring to my own post:
With low or no concentration present, the before mentioned misreadings (as mentioned in the guideline) can happen.
@bozden: no strings attached, and are contracts valid in every part of the world ???
Hey everyone,
Thank you for your feedback that some of the recording and validation guidelines do not reflect all languages. The CV site currently doesn’t support language-specific content but we are working towards launching this feature in 2022. In the meantime, we would still like to support people to understand what the guidelines mean in their own context.
Ahead of the end of year dataset release we are running Community Validation Drive to support voice clip validation.
I would like to encourage community members that have language-specific needs which aren’t met by the existing criteria to please start a new discourse topic discussion on Common Voice/language threads (e.g Spanish) or feel free to use another kind of platform for example Stefano’s post on Esperanto criteria.
-
Brainstorm what is missing or doesn’t work for your language with the current criteria, for example by creating a discourse post or hosting a community call online
-
Draft a document that has some examples of what you would like to include. We have created a template, that you can use to support your discussion.
-
Have a review period in which people can look over the doc and respond
-
When you’re happy with the draft, email commonvoice@mozilla.com with a word formatted document, or add directly to this google drive
If your language doesn’t have a Common Voice discourse thread like Spanish and you would like one created please direct message me via discourse.
Creating language-specific resources such as a validation can help with your community goals.
I’m happy to include links to these validation documents in the community playbook to help make them more visible. Please note the Community Playbook is linked on the website and community portal. This solution is just temporary - as next year we will be able to do this more easily on the platform - but hopefully, it can help with the community’s needs.
If you have any questions please ket me know.
Thank you,
Hillary
we took another approach to “localized” the criteria page (zh / en page) to be more suitable to our locales.
for example, on the second part of “varying Pronunciations”, the first English example explained the position of syllable, but in Chinese version we discuss about heteronym (one char with multiple pronunciations); the second English example talk about number of syllable, and in Chinese version we talk about difference of tones.
Remember that we’re doing “Localization” instead of “Translation”, we can always localized the criteria page to be more relative to local context.
This is what we did. On the other hand, as Pontoon has a 1-1 relationship in sentence translations you cannot increase/decrease the number of bullet points/examples which might be required by your language.
Should I record sentences if I have speech defects (I can’t pronounce one sound)? Will it be useful for dataset or not?