Multi-language Dataset Beta Release

What languages can we expect to be released?
And about those which won’t be - what’s missing for them?

We are going to release all of the languages that have data in them as of October 2018 which includes 16 languages. You can see the full list of languages here

Hello Everyone, the new dataset is back up and ready for use!


I imagine you’re getting tons of requests for access to the voice data right now. Is giving access a manual process (i.e., giving access only after review of each form submission by a human)?

If you would like voice access you can fill out the form above and we will be sending a link via email at the beginning of each day. Each of the sentences you hear has been reviewed by 2 humans to ensure its correctness. Does that answer your question?

Thanks, that fully answers my question. I filled out the form above earlier today (shortly after your post), but haven’t receive any email, so I was just wondering if it had to be approved first. Again, thanks for your work on this and I’m looking forward to working with the data!

You should be getting your email shortly!

Hi @lsaunders,
I’ve found there is an audio file broken in zh-TW dataset.
This audio is in other.tsv.

Thanks for the heads up @areyliu6!
@gweber Do you need any further information about the break? Lets review in the next sprint meeting.

I downloaded the kab dataset but I can’t find the transcript (sentences). I got only the audio files.
I’m going to train the first dataset using deepspeech to show it on an event we are going to organize to show the importance and recruit more recorders from Kabylia.

The sentences are in the clips.tsv file. If you want to get them split up by language, validity & bucket, you need to run the CorporaCreator on the file.

But when I downloaded the audio files, there is no file clips.tsv. Is this file downloadable lownly?

Question from some local community member about the data:

Are we delivering the voices which only had been verified for multiple times on the site?
If yes, then what’re the differences of voices listed in valid.tsv and in other .tsv (besides invalid)?

You can find that in the first post in this thread:

The clips.tsv files contains the number of votes each clip got and the audio files are everything we have for this language up to this point (which for that release is 2018-12-19 I think). The valid.tsv only contains clips which have at least 2 up-votes and more up- than down-votes.

Thanks @lsaunders gweber

Today we have released a new version of the dataset and keep improving the automation of the process.