Dataset versions

whoever · June 12, 2021, 8:26am

I am new to everything regarding this Mozilla common voice dataset and I have a question.
There are multiple versions of datasets (Kyrgyz language specifically), starting from 1 to 6.1. Is there overlap between versions? For example, are clips from dataset version 1 in the version 6.1, or every version consists of separate clips?

mikoMK · June 12, 2021, 3:39pm

Hi @whoever and welcome

Yes, generally all clips from a previous version are included in the next version. It’s best to always use the latest version.
I say “generally” because sentences could be removed. This could happen if there are sentences in the dataset which came from a source with an incompatible license or I think if a user request for his own clips to be removed afterwards.
It seems something like this happened between version 2 and 3 of the Kyrgyz dataset where its size went down from 508 MB to 502 MB.

I hope that helps

whoever · June 12, 2021, 3:53pm

Thanks, it helped. And, if maybe you know, on the languages page Kyrgyz dataset has 37 hours of validated data, but on the datasets page, there are only 11. Is there any way I can download 37 hours of validated data? Or its just not everything approved by the community?
Thanks!

mikoMK · June 12, 2021, 4:07pm

The datasets are released twice a year (June and December). Version 6.1 is from December 2020.

The information on the language page is the current size in the system and will be included in the next release (June 2021)

whoever · June 12, 2021, 4:12pm

Ohh, thanks a lot! Really appreciate the answers! Then I’ll wait for the next release.

Topic		Replies	Views
Accessing the extended version of a dataset Common Voice participation , issue , dataset	8	1586	December 6, 2021
Common Voice 2021 Mid-year Dataset Release! Common Voice announcements , dataset	8	2850	August 4, 2021
Dataset downloads Dutch Common Voice dataset	4	1351	June 12, 2019
How to Access Old Release Version of Dataset? Common Voice dataset	0	614	September 7, 2020
Common Voice Dataset Release - Mid Year 2020 Common Voice announcements	16	24305	August 21, 2020

Dataset versions

Related topics