Is is possible to download only validated recordings?

carente · October 13, 2023, 9:26am

Hi, great project! Kudos to you all!

I am looking at the datasets, and for some languages, the amount of validated recordings is very low compared to the total number of recorded hours.

Is it possible somehow to only download validated recordings?

In fact, should non-validated recordings be included into the datasets in the first place? Downloading 30 GB of recording when one can (hopefully) use 7 GB with some confidence doesn’t seem very efficient or environment friendly.

bozden · October 13, 2023, 1:19pm

No, there is no way.

doesn’t seem very efficient or environment friendly

Good point for the environment

But, invalidated recordings are useful for analyzing the reasons, especially for language community leads. Even some can be already correct and after processing them you can build your own splits, making them validated. You can also run some offline process (local validators) for the not-yet-specified ones (other.tsv) for your project.

As the datasets are CC0, having the whole dataset released is the right way I think.

Maybe validating them for the next dataset release will help you and others

carente · October 13, 2023, 1:27pm

That’s too bad.

Of course full data should be available and can be useful. But I think it’d be even better to offer a download version with only validated data.

So that’s my suggestion if that is something you guys might consider at some point.

bozden · October 13, 2023, 1:30pm

I’m just another user like you @carente

Maybe sending a feature request on github is the right way…

carente · October 13, 2023, 1:47pm

Got it, I thought you might be in the admin team for some reason.

Thanks!

gina · October 18, 2023, 10:34am

Hi Carente

We appreciate your suggestion, and we continuously work towards improving our data and user experience. To help us keep track and manage requests effectively, kindly submit your feature request on our GitHub repository. I’ve also taken note of your suggestion and will bring it to the attention of our team.

Thank you!

bozden · October 18, 2023, 1:34pm

Hey @gina, in this case, would you also include the following to your notes:

For people working on dataset quality/advancement - without training a model - only the .tsv files (along with clip durations) are needed. If the downloads can include a “tsv-only” file, that would be very helpful.

gina · October 19, 2023, 5:18am

Hey Bülent

Thank you so much, I’ll add this to the notes.

kathyreid · October 22, 2023, 2:45am

I have raised this as a feature request on the cv-datasets repository, which is where I get the JSON-formatted data I use for visualisation.

github.com/common-voice/cv-dataset

FEATURE REQUEST: Make the `.tsv` files that are part of a downloaded dataset available separately

opened 02:43AM - 22 Oct 23 UTC

KathyReid

## User story * As a researcher, I frequently create data visualisations bas…ed on the `validated.tsv` file of a language / release. Currently the only way to obtain this file is to download the _whole_ dataset or delta. I want to be able to get _just_ the `.tsv` files related to a release, without downloading the clips, so that I can do faster data visualisations. ## Acceptance criteria * The files - `clip_durations.tsv` - `invalidated.tsv` - `other.tsv` - `reported.tsv` - `validated.tsv` are available - for each language in the CV corpus (about 103 at time of writing) - for each version - including delta releases from the [CV datasets download page](https://commonvoice.mozilla.org/en/datasets), in the same way as we currently download the `.tar.gz` formatted datasets.

gina · October 25, 2023, 11:31am

Thanks Kathy, I will flag this to the team.

Topic		Replies	Views
Question: All datasets without recordings (i.e. clips.tsv) Common Voice	0	401	August 22, 2022
Downloading raw audio data Common Voice	3	796	June 22, 2018
Suggestion: Offer download of sample of dataset Common Voice feedback	7	1390	January 3, 2021
Dataset downloads Dutch Common Voice dataset	4	1366	June 12, 2019
Add Basque to the dataset page Common Voice dataset	6	1131	June 12, 2019

Is is possible to download only validated recordings?

Related topics