Dataset versions

I am new to everything regarding this Mozilla common voice dataset and I have a question.
There are multiple versions of datasets (Kyrgyz language specifically), starting from 1 to 6.1. Is there overlap between versions? For example, are clips from dataset version 1 in the version 6.1, or every version consists of separate clips?

Hi @whoever and welcome :wave:

Yes, generally all clips from a previous version are included in the next version. It’s best to always use the latest version.
I say “generally” because sentences could be removed. This could happen if there are sentences in the dataset which came from a source with an incompatible license or I think if a user request for his own clips to be removed afterwards.
It seems something like this happened between version 2 and 3 of the Kyrgyz dataset where its size went down from 508 MB to 502 MB.

I hope that helps :slight_smile:


Thanks, it helped. And, if maybe you know, on the languages page Kyrgyz dataset has 37 hours of validated data, but on the datasets page, there are only 11. Is there any way I can download 37 hours of validated data? Or its just not everything approved by the community?

The datasets are released twice a year (June and December). Version 6.1 is from December 2020.
2021-06-12 18_02_55-Common Voice - Firefox Developer Edition
The information on the language page is the current size in the system and will be included in the next release (June 2021)

1 Like

Ohh, thanks a lot! Really appreciate the answers! Then I’ll wait for the next release.