Discussion of new guidelines for recording validation

This is gold. Where were these when I was first starting out? :slight_smile:

I guess newbies might want to go through these before they even start to dip their toes.

On the other hand, are there any procedures of evaluating beginner contributors? If someone ignores the guidelines systemically (eg. unwittingly) maybe they could use some help.

1 Like

Hi everyone. Thanks for sharing the new guidelines of this great project, I already have started validating quite a number of voice samples and would like to know if am allowed to record voice samples in languages that are not my native language.

Hi there, if you are a speaker of the language then yes you definitely should record samples. My advice would be, don’t record anything that you don’t know what it means, but if you can understand the sentence, then sure, the more voices the better.

Many thanks for your swift response.

1 Like

Hi everybody,
Given the complexity of the guidelines, I think only meticulous people should be allowed to review and accept sentences. I am thinking of my own language where you will have many people being able to provide sentences, but when it comes to reviewing, you cannot guarantee that all accepted sentences/recording are actually correct.
My question is, will there be some kind of admin who will do a final reveiw to check quality?
There should even be many levels of control if we want to avoid a low quality dataset.

Cheers
Ibrahima

The quality check is the validation by two other speakers, both in sentence collection and recording. If you have any other specific questions you could make a new topic and tell us a bit about your language and the kind of issues you are finding or expecting to find :slight_smile:

1 Like

Thanks to a translator from the community, the guidelines for recording validation are now also available in Esperanto:

Gvidlinioj por kontroli registraĵojn: https://parolrekonado.github.io/gvidlinioj/

If we should reject stuttered words then wouldn’t this lead the algorithm to have difficulty understanding stutterers? Ideally it should understand them too right?

1 Like

Another point for the guideline(s) which i discovered during validating in the english section lately: linguistic Filler (filler words)

For example
Text to read:
Today i recorded many sentences for “Common Voice”.

Recorded text:
Today i ehhhhh recorded many sentences for ehhhh “Common Voice”.

I think this is counted under “adding extra words to the sentence” or “trying to say the word twice” categories… At least, I treat them as such (and reject).

1 Like

@bozden I agree with you, this should be rejected.

To my knowledge, and personal opinion: AI has a finite memory (dictionary), which consists of most frequent words and sub words, so the data should hold at most value.
For example “ehhhh” has no direct meaning that relates to the sentence, therefore it should be considered as noise.
If the goal is to build a high quality dataset, which should be the case, then high standards should be applied on filtering out such cases.

Noise is also important, but this should not be part of the actual dataset that we are trying to build.

Noise can later be added syntheticlly to have a more robust AI model.

1 Like

@daniel.abzakh, I’m not strict on this. Some texts may include such words and they must be spoken as they are written. Some conversations in novels, transcripts from oral history interviews do include that as hesitation for example.

Here, the “ehhhh” were spoken out, but it doesn’t exist in the text.

I agree.

Same here, i also pressed no during validation.

I disagree, it is hard to add authentic noise, adding static or mixing in sounds is fine, but if you do that then you miss the frequency effects of actual human experience.

About the ehhhh example I have no clear feelings, I don’t think it would particularly help or harm the model in low enough quantities.

Would you agree that removing noise is harder than adding noise?

Could you elaborate on this point?

This sounds to me “if it can be avoided, then it will be better”.

I have not tried to train STT models, but I can imagine it is sensitive to data quality just as NMT models.

Your input is appreciated!

Less so because of the way that the task works. Imagine this, “p” is closer to “b” than to “k”, but “pin” is not semantically closer to “bin” than to “kin”. It’s more like OCR than NMT.

I more frequently hear the wind and birds tweeting than sounds of elephants roaring or large explosions.

For the purposes of training ASR systems, removing the noise is not necessary.

I disagree. If you want to make a LibriSpeech-style corpus sure, but if you want to make a corpus that works when people are driving down the road in rural Chuvashia, you’d better have road and car noise in the dataset.

2 Likes

Thank you Francis.
I’ll keep those notes in mind.

Regular breaks i would also mention.

For validation but also and especially for the recording process.
With lack of concentration the contributor hears or records everything else, but not the shown sentence.

We have two major problems with these guidelines:

  1. They do not include every possible scenario. They even may be language-specific and there is no room to give further info (except in a sub-Discourse, if there is a community).
  2. But more importantly, nobody reads them!

In my opinion, these guidelines should not be voluntary reading, but must be presented as a contract. Even a compulsory test can be a good idea…

1 Like