Hi, I wanted to offer a suggestion that I think would be useful for consumers of the dataset. In my case, I’d like to download only a few samples of the dataset to test out some things. And, in many other cases, I could see people wanting to get their code working or experiment with just a small portion of the dataset without taking the investment of a 48 GB download.
I think it would be super helpful to offer a tiny download that contains something like a random sample of 100 recordings, packaged in the same exact format as the full download. Perhaps anyone else could chime in if they think that’d be useful or if it’s not necessary.
Well, the point is you could listen to a sample to hear the format of the dataset and begin building your entire data pipeline without waiting for it to download. I suppose in an ideal world it would be some selector so you could choose to download just, let’s say, 1 GB of the data. Then, with that you could then even perform accuracy tests or use it for non-stt applications.
One thing I really would like to download separately from the big database are the data tables like validated.tsv and training.tsv. I often use them to get some information like how big is the training dataset compared to the complete dataset or how much the dataset would decrease if I would delete sentences with one downvote. Plus I often switch PCs, so a way to only download the metadata would be very helpful.
2 Likes
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
6
A note on this, we are already talking about how to allow a more granular access to our dataset, but we haven’t started any work on this yet.
Is there anywhere the tsv files are hosted besides the main download link? Every time theres a new release (in this case 6.1) I’d like to review the metadata and do some EDA on the files/speakers/sentences but I have to download 56GB of audio to do so! Thank you.