Good morning (or evening, or afternoon) to the delightful Common Voice community of contributors, dataset users and folks hanging out with us to learn more about language and technology.
It’s one of my favorite times of the year, it’s time for another dataset release!
We now have a total of 112 languages live and we would be so excited to welcome more.Please shout if you have any questions, suggestions or want to celebrate along!
So many thanks to the sentence, voice and technical contributors who made this possible.
I have a question about the dataset delta releases. The english delta 14 should contain 71 recorded and 56 validated hours. However the validated.tsv only contains 4035 entries. That is only 10% of the 44035 clips. It feels like I’m missing a lot of data if I only add validates clips with each delta release.
I checked the en delta, and there are 44035 .mp3 files, and this matches the sums of records in validated/invalidated/other buckets. Many of the recordings are in other, i.e. recorded but not validated yet…
If the “validated hours” figure on the download page and metadata is not correct, that means you will not lose any data if you use the validated bucket. If the buckets are not created correctly, then we all are in trouble.