Is is possible to download only validated recordings?

Hi, great project! Kudos to you all!

I am looking at the datasets, and for some languages, the amount of validated recordings is very low compared to the total number of recorded hours.

Is it possible somehow to only download validated recordings?

In fact, should non-validated recordings be included into the datasets in the first place? Downloading 30 GB of recording when one can (hopefully) use 7 GB with some confidence doesn’t seem very efficient or environment friendly.

1 Like

No, there is no way.

doesn’t seem very efficient or environment friendly

Good point for the environment :+1:

But, invalidated recordings are useful for analyzing the reasons, especially for language community leads. Even some can be already correct and after processing them you can build your own splits, making them validated. You can also run some offline process (local validators) for the not-yet-specified ones (other.tsv) for your project.

As the datasets are CC0, having the whole dataset released is the right way I think.

Maybe validating them for the next dataset release will help you and others :slight_smile:

That’s too bad.

Of course full data should be available and can be useful. But I think it’d be even better to offer a download version with only validated data.

So that’s my suggestion if that is something you guys might consider at some point. :slight_smile:

I’m just another user like you @carente :slight_smile:

Maybe sending a feature request on github is the right way…

1 Like

Got it, I thought you might be in the admin team for some reason.

Thanks!

Hi Carente

We appreciate your suggestion, and we continuously work towards improving our data and user experience. To help us keep track and manage requests effectively, kindly submit your feature request on our GitHub repository. I’ve also taken note of your suggestion and will bring it to the attention of our team.

Thank you!

1 Like

Hey @gina, in this case, would you also include the following to your notes:

For people working on dataset quality/advancement - without training a model - only the .tsv files (along with clip durations) are needed. If the downloads can include a “tsv-only” file, that would be very helpful.

2 Likes

Hey Bülent

Thank you so much, I’ll add this to the notes.

1 Like

I have raised this as a feature request on the cv-datasets repository, which is where I get the JSON-formatted data I use for visualisation.

2 Likes

Thanks Kathy, I will flag this to the team.

1 Like