Discussion of new guidelines for recording validation

bozden · October 3, 2021, 2:07pm

I think this is counted under “adding extra words to the sentence” or “trying to say the word twice” categories… At least, I treat them as such (and reject).

daniel.abzakh · October 5, 2021, 11:09pm

@bozden I agree with you, this should be rejected.

To my knowledge, and personal opinion: AI has a finite memory (dictionary), which consists of most frequent words and sub words, so the data should hold at most value.
For example “ehhhh” has no direct meaning that relates to the sentence, therefore it should be considered as noise.
If the goal is to build a high quality dataset, which should be the case, then high standards should be applied on filtering out such cases.

Noise is also important, but this should not be part of the actual dataset that we are trying to build.

Noise can later be added syntheticlly to have a more robust AI model.

bozden · October 6, 2021, 1:04pm

@daniel.abzakh, I’m not strict on this. Some texts may include such words and they must be spoken as they are written. Some conversations in novels, transcripts from oral history interviews do include that as hesitation for example.

daniel.abzakh · October 6, 2021, 2:49pm

Here, the “ehhhh” were spoken out, but it doesn’t exist in the text.

I agree.

robovoice · October 8, 2021, 8:05am

Same here, i also pressed no during validation.

ftyers · October 8, 2021, 2:11pm

I disagree, it is hard to add authentic noise, adding static or mixing in sounds is fine, but if you do that then you miss the frequency effects of actual human experience.

About the ehhhh example I have no clear feelings, I don’t think it would particularly help or harm the model in low enough quantities.

daniel.abzakh · October 8, 2021, 5:39pm

Would you agree that removing noise is harder than adding noise?

Could you elaborate on this point?

This sounds to me “if it can be avoided, then it will be better”.

I have not tried to train STT models, but I can imagine it is sensitive to data quality just as NMT models.

Your input is appreciated!

ftyers · October 18, 2021, 2:42pm

Less so because of the way that the task works. Imagine this, “p” is closer to “b” than to “k”, but “pin” is not semantically closer to “bin” than to “kin”. It’s more like OCR than NMT.

I more frequently hear the wind and birds tweeting than sounds of elephants roaring or large explosions.

For the purposes of training ASR systems, removing the noise is not necessary.

I disagree. If you want to make a LibriSpeech-style corpus sure, but if you want to make a corpus that works when people are driving down the road in rural Chuvashia, you’d better have road and car noise in the dataset.

daniel.abzakh · October 21, 2021, 6:32pm

Thank you Francis.
I’ll keep those notes in mind.

robovoice · October 25, 2021, 8:46am

Regular breaks i would also mention.

For validation but also and especially for the recording process.
With lack of concentration the contributor hears or records everything else, but not the shown sentence.

bozden · October 25, 2021, 11:47am

We have two major problems with these guidelines:

They do not include every possible scenario. They even may be language-specific and there is no room to give further info (except in a sub-Discourse, if there is a community).
But more importantly, nobody reads them!

In my opinion, these guidelines should not be voluntary reading, but must be presented as a contract. Even a compulsory test can be a good idea…

robovoice · October 25, 2021, 3:33pm

They do not include every possible scenario.

They do not have to - just the most common

They even may be language-specific and there is no room to give further info (except in a sub-Discourse, if there is a community).

Changing to language specific rules and displaying them.

But more importantly, nobody reads them!

I read them (hopefully you and some others too, so this statement is incorrect.

In my opinion, these guidelines should not be voluntary reading, but must be presented as a contract.

based on your scenario in point2: who is reading the contract? One click on accept, behavior as before changes not very much.

Even a compulsory test can be a good idea…

And if the contributor fails the test leads to. …

Inculding the language specific rules via sentence collector for reading and recording.
(You have also some cc0 sentences for newly started languages/dialects btw.)

bozden · October 25, 2021, 3:47pm

I even read every post in Discourse, but I’m a nerd

You are right in some aspects, but I stand my ground…

We are preparing a large-scale campaign and it will be not clear for the general public what to accept, what to reject. There are many nuances and I worry about the dataset quality. Therefore we are preparing a Youtube channel where we publish video guides and short talks to give the idea to non-document-readers…

robovoice · October 27, 2021, 5:41pm

Referring to my own post:

With low or no concentration present, the before mentioned misreadings (as mentioned in the guideline) can happen.

@bozden: no strings attached, and are contracts valid in every part of the world ???

heyhillary · November 25, 2021, 1:39pm

Hey everyone,

Thank you for your feedback that some of the recording and validation guidelines do not reflect all languages. The CV site currently doesn’t support language-specific content but we are working towards launching this feature in 2022. In the meantime, we would still like to support people to understand what the guidelines mean in their own context.

Ahead of the end of year dataset release we are running Community Validation Drive to support voice clip validation.

I would like to encourage community members that have language-specific needs which aren’t met by the existing criteria to please start a new discourse topic discussion on Common Voice/language threads (e.g Spanish) or feel free to use another kind of platform for example Stefano’s post on Esperanto criteria.

Brainstorm what is missing or doesn’t work for your language with the current criteria, for example by creating a discourse post or hosting a community call online
Draft a document that has some examples of what you would like to include. We have created a template, that you can use to support your discussion.
Have a review period in which people can look over the doc and respond
When you’re happy with the draft, email commonvoice@mozilla.com with a word formatted document, or add directly to this google drive

If your language doesn’t have a Common Voice discourse thread like Spanish and you would like one created please direct message me via discourse.

Creating language-specific resources such as a validation can help with your community goals.

I’m happy to include links to these validation documents in the community playbook to help make them more visible. Please note the Community Playbook is linked on the website and community portal. This solution is just temporary - as next year we will be able to do this more easily on the platform - but hopefully, it can help with the community’s needs.

If you have any questions please ket me know.

Thank you,

Hillary

irvin · November 25, 2021, 4:43pm

we took another approach to “localized” the criteria page (zh / en page) to be more suitable to our locales.

for example, on the second part of “varying Pronunciations”, the first English example explained the position of syllable, but in Chinese version we discuss about heteronym (one char with multiple pronunciations); the second English example talk about number of syllable, and in Chinese version we talk about difference of tones.

Remember that we’re doing “Localization” instead of “Translation”, we can always localized the criteria page to be more relative to local context.

bozden · November 25, 2021, 10:57pm

This is what we did. On the other hand, as Pontoon has a 1-1 relationship in sentence translations you cannot increase/decrease the number of bullet points/examples which might be required by your language.

bicolino34 · November 28, 2021, 6:06pm

Should I record sentences if I have speech defects (I can’t pronounce one sound)? Will it be useful for dataset or not?

heyhillary · November 29, 2021, 10:56am

Hey Irvin, thanks for highlighting this approach. As Bulent has mentioned unfortunately Pontoon has a 1-1 relationship in sentence translation and doesn’t give much flexibility for langauges that need more examples or bullet points.

heyhillary · November 29, 2021, 11:02am

Hey Nazar,

Currently, the validation criteria provide room for varying pronoucations however speech pathologies are not explicitly called out.

There is a risk that your clips could be incorrectly invalidated. Your contributions are valuable as we want to ensure that the ASR models can understand everyone.

Internally the Common Voice Team have been looking into how we can be more inclusive of people with varying speech pathologies. If you are interested in engaging with this please direct message me.