I am looking at the datasets, and for some languages, the amount of validated recordings is very low compared to the total number of recorded hours.
Is it possible somehow to only download validated recordings?
In fact, should non-validated recordings be included into the datasets in the first place? Downloading 30 GB of recording when one can (hopefully) use 7 GB with some confidence doesn’t seem very efficient or environment friendly.
doesn’t seem very efficient or environment friendly
Good point for the environment
But, invalidated recordings are useful for analyzing the reasons, especially for language community leads. Even some can be already correct and after processing them you can build your own splits, making them validated. You can also run some offline process (local validators) for the not-yet-specified ones (other.tsv) for your project.
As the datasets are CC0, having the whole dataset released is the right way I think.
Maybe validating them for the next dataset release will help you and others
We appreciate your suggestion, and we continuously work towards improving our data and user experience. To help us keep track and manage requests effectively, kindly submit your feature request on our GitHub repository. I’ve also taken note of your suggestion and will bring it to the attention of our team.
Hey @gina, in this case, would you also include the following to your notes:
For people working on dataset quality/advancement - without training a model - only the .tsv files (along with clip durations) are needed. If the downloads can include a “tsv-only” file, that would be very helpful.