Dataset versions

I am new to everything regarding this Mozilla common voice dataset and I have a question.
There are multiple versions of datasets (Kyrgyz language specifically), starting from 1 to 6.1. Is there overlap between versions? For example, are clips from dataset version 1 in the version 6.1, or every version consists of separate clips?

Hi @whoever and welcome :wave:

Yes, generally all clips from a previous version are included in the next version. It’s best to always use the latest version.
I say “generally” because sentences could be removed. This could happen if there are sentences in the dataset which came from a source with an incompatible license or I think if a user request for his own clips to be removed afterwards.
It seems something like this happened between version 2 and 3 of the Kyrgyz dataset where its size went down from 508 MB to 502 MB.

I hope that helps :slight_smile:

2 Likes

Thanks, it helped. And, if maybe you know, on the languages page Kyrgyz dataset has 37 hours of validated data, but on the datasets page, there are only 11. Is there any way I can download 37 hours of validated data? Or its just not everything approved by the community?
Thanks!

The datasets are released twice a year (June and December). Version 6.1 is from December 2020.
2021-06-12 18_02_55-Common Voice - Firefox Developer Edition
The information on the language page is the current size in the system and will be included in the next release (June 2021)

1 Like

Ohh, thanks a lot! Really appreciate the answers! Then I’ll wait for the next release.