Dataset releases - What's more valuable for you?

Hello everyone,

I’m currently working on a proposal for the project strategy around dataset releases. As you might already know we have done a couple of them already and since then we have had a lot of feedback an questions about it.

That’s why I open this topic, to try to understand what would be more valuable for you and your communities/projects.

  • How often do you need dataset updates? Why?
  • Which languages are you most interested in? Why?
  • What brings you more value from the dataset update?
  • Other considerations about these updates?

Your feedback will inform my proposal to the team and based on that we will take a decision and share back to you.


Note: I’ll keep this topic open for feedback for one week (until April 17th)

I think Spanish is one of the most spoken languages in the world and the dataset is not yet ready for download.
I do not know how odten we need updates, I think i might depende in the changes that the data has received.

I think once a month would be ideal, but I don’t know how hard it is to release a dataset. Maybe each 3 months would be good.
About languages I wanna see Portuguese, Japanese and Korean to be launched, even though I don’t speak Japanese and Korean, the dataset could be used for learning. I speak Portuguese and we don’t have a big dataset like in English for speech recognition such as librivox, ted-talks, etc. About what value the updates brings I’d say more data, even though I see the only English, German and French seems to get some considerable amount of sentences each month, I was hoping the spanish would have at least 50 hours by now, but I see that the community still new.
For me the release should be on first day of the month, so everyone would be aware of when to get the new release, and I think I’ve said this before the structure of the dataset should be more developer-friendly, a multi-lingual version would be cool with a nice tree structure, not so many files per folder, maybe ten thousands clips per folder, I don’t like tar.gz files since you have to extract twice, at least on Windows, a release of a .zip would be nice.

That’s all, Thanks in advance.

Hello @nukeador

Not to often, we need a really good amount of data to noticed an improvement.

Spanish, even if it is small amount of data to test and create middle ground of ideas, then people will start sharing experiments on a common test. I was talking with @daniel.cruzado and was really hard to share what can really improve the training. @daniel.cruzado Is getting better WER than me and we don’t really know if it just that my test set is bad. I think releasing a small test set will be a good point of reunion to pleople that are doing disperse tests on their own test sets.

1 Like

Not to often, we need a really good amount of data to noticed an improvement.

+1. Except maybe for very small datasets where small increases are always better than none, I don’t think it’s worth publishing an update if the data aren’t like 20% bigger¹ than what was previously released.

I mean, if it’s no trouble for you it’s always good to release an update, but it doesn’t make a significant difference as far as training models goes if the data didn’t grow much. So, I believe the choice to release more often than that should mostly depend on how convenient releasing new data is for you.

¹ NB: I’ve been very conservative with that “random” percentage as I don’t do ML for speech recognition. But I do ML for image processing, and last time we bothered updating our model with a new source database, the increase was in the order of x20 (+2000%). Although we would have done it with a lower increase too, I really don’t think we would have tried with anything lower than +50%.


(feel free to split these off to a separate topic if you feel that’s better)

Since most of us are probably “data-hungry” and just for a better understanding:

What (steps) does it actually take to do a release ?

  • one could have the impression that it could be nearly automated with the up and downvotes bases on the listening data
  • but it seems it’s not
  • so what needs to be done manually ?
  • and why do we have to do them manually at present (perhaps the community can think of ways to get around that) ?

Another question is: most projects seem to have “nightlies” / “alphas/betas”, in other words, why isn’t the “raw” data available in between the releases ?

  • So with nightlies you would be “on your own”, but have the max of data available.
  • Releases are more curated, but you would have to wait for them.

I asked about this before and was told that a lot of manual validation is required. It seems that even with the validation that occurs on the Common Voice site, it’s not enough to prevent bad clips slipping through.

Thanks everyone for your feedback, based on that I’ll be writing a proposal for the team to review and share it back here for a final iteration.


1 Like

For new languages with fewer hours, I suggest a monthly update. Because during this time there can be a significant change in the database (for example, 1 hour was recorded and a month later it was 2-3 hours).
I agree that you can make an update only if the changes were more than 20%.

Today we have released a new version of the dataset and keep improving the automation of the process.