Suggestion: Offer download of sample of dataset

Hi, I wanted to offer a suggestion that I think would be useful for consumers of the dataset. In my case, I’d like to download only a few samples of the dataset to test out some things. And, in many other cases, I could see people wanting to get their code working or experiment with just a small portion of the dataset without taking the investment of a 48 GB download.

I think it would be super helpful to offer a tiny download that contains something like a random sample of 100 recordings, packaged in the same exact format as the full download. Perhaps anyone else could chime in if they think that’d be useful or if it’s not necessary.

What could you do with 100 recordings?

A potentially more interesting subset would be one sample per sentence or per speaker.

Well, the point is you could listen to a sample to hear the format of the dataset and begin building your entire data pipeline without waiting for it to download. I suppose in an ideal world it would be some selector so you could choose to download just, let’s say, 1 GB of the data. Then, with that you could then even perform accuracy tests or use it for non-stt applications.

You could get a representative idea from validating clips on the site if that’s what you’re looking for as well.

I guess I’m missing what you’re trying to accomplish with so little data.

One thing I really would like to download separately from the big database are the data tables like validated.tsv and training.tsv. I often use them to get some information like how big is the training dataset compared to the complete dataset or how much the dataset would decrease if I would delete sentences with one downvote. Plus I often switch PCs, so a way to only download the metadata would be very helpful.

2 Likes

A note on this, we are already talking about how to allow a more granular access to our dataset, but we haven’t started any work on this yet.

Check out this threat. Some data sets are too large

Is there anywhere the tsv files are hosted besides the main download link? Every time theres a new release (in this case 6.1) I’d like to review the metadata and do some EDA on the files/speakers/sentences but I have to download 56GB of audio to do so! Thank you.