Running out of sentences to validate

Hi, while I’m validating I’m getting these messages, I’m not able to record apparently, and now I’m not able to validate more clips, I think there’s a bug on the site. I should be able to re-record even if needed, since speech is very sparse, reading the same sentences wouldn’t be a issue for me though. I think DeepSpeech would benefit since every time that we speak we speak differently.

Yep, that’s a bug. English has a massive backlog so you shouldn’t be running out of clips to validate.

This was happening the other day and Gregoor said there’d been some downtime with some of the servers, so it’s probably that again. If you keep trying it should eventually work. Either refresh the page or go back to the homepage and then go back to validation.

Hi. When recording in german, i dont get any thing displayed to record, also no message, just a plain screen. Recording in english works as well as validating. I dont think i recorded all possible sentences yet, maybe 1200 so far. A bug?

I not sure about this but, one key element is to get a huge diverse set of voices. Once you have recorded at least 250-300 clips, it’s probably a good idea to invest more time into review and specially in looking for more people to donate their voices (gathering some friends, running a workshop at the Uni…)

Cheers.

What Ruben said is on point. Some other notes:

  • each sentence set as well as users are segmented into different buckets (train, dev, test) for ML purposes. The smallest bucket in german might just be 1.2k sentences.
  • we definitely need to add UI for when recording limit has been reached, it’s tracked on GitHub

Maybe the UI should suggest this once the user hits a certain amount of recordings? There are many users with thousands of recordings and some languages seem to have too few validators.

Yes, we are aware of that and we plan to make UI changes to encourage people in the right direction and set up proper individual and language goals :slight_smile:

1 Like

Ok, I see what you mean by saying the 250 clips is enough, but I think this only applies for English, since is has more people contributing to it, for example Kabyle I think each speaker has recorded on average 21 minutes, sure diversity on the dataset would be the primary thing to look into, but again some languages may not have many contributors, I did some PR on my channel about Common Voice. I focus on Spanish this month, and thanks for the reply.
Hope y’all a good day.

From the data we have, with less than 1000 different speakers the quality of the speech recognition is very degraded, that’s why having at least 1000 speakers is as important as having 2000 hrs minimum.

Is the deepspeech team testing German and French? I’d like to see some info on that.

Best place to ask is #deep-speech :smiley:

1 Like

Also started with a lot if validating, but in german we ran out of recordings to validate every day, so I switched to more recording, and now i can not contribute anymore. Besides, there are people with 4000 recordings.

I think is also important the quality of the record. You should read natural but correctly.
Quality better than quantity.