Older English dataset question

h_caulfield · May 19, 2020, 12:17am

Apologies if this question has already been answered and I’m not sure where this question goes, but I downloaded an earlier version of the English dataset in June 2018 and it contains ~469h, with the following structure:

/cv-invalid
/cv-other-dev
/cv-other-test
/cv-other-train
/cv-valid-dev
/cv-valid-test
/cv-valid-train

and audio as: sample-000000.mp3, sample-000001.mp3, sample-000002.mp3, sample-000003.mp3, etc. within each subdirectory.

Visiting CommonVoice recently, I see there’s a lot more en data since 2018, but the current version has a different structure and audio file naming convention. Assuming this is a superset of my older version, I’m hoping to use it as test and the larger recent complement as a training set.

To do this, I need to know which audio files overlap in the current version, any chance there is a key for this? Any help would be appreciated.

nukeador · May 19, 2020, 10:39am

Does your dataset include the file describing each file and the corresponding sentence? It should include an id.

h_caulfield · May 19, 2020, 12:06pm

Unfortunately, no client ID in the .csv files in subfolders of the 2018 set. The only columns are:

filename text up_votes down_votes age gender accent duration

nukeador · May 19, 2020, 12:33pm

Let me check with the team in case someone knows better about the 2018 dataset.

nukeador · May 20, 2020, 12:07pm

@mhenretty is this something you remember how we generated at the time?

The team suspects IDs might just be continuous numbers but we are not sure.

h_caulfield · May 20, 2020, 7:16pm

I’ll keep my fingers crossed someone remembers but I can always use a different training set instead.

Thanks for checking with the team, I really appreciate the data!

makoto_wada_jp · June 15, 2021, 10:49am

@h_caulfield, @nukeador any information regarding the following will be greatly appreciated:

Is it possible to download this dataset? It seems that it is no longer available from Common Voice Datasets. Although Common Voice Corpus 1 (2019-02-25) is available for download, this version is different from your version since the file names for Common Voice Corpus 1 (2019-02-25) do not follow sample_#+.mp3 convention of your version.
Any luck with finding out overlaps since the file name convention has change? Just curious as to whether this has been resolved.

Lasty, although not related to this topic, I wish the Mozilla Discourse support “Accepted Answer” feature like in stackoverflow so that we can know whether it has been resolved. I do not see such feature mentioned in Who has which moderation powers on Discourse?. I guess Mozilla Discourse is different from Q&A type of community i.e. stackoverflow.

Topic		Replies	Views
Looking for Common Voice Corpus English before 2019-02-25 (v1) release Common Voice	6	891	June 21, 2021
How to download common_voice_9.0 dataset? Common Voice	3	95	January 21, 2026
Common Voice mid-year release - more data, more languages! Common Voice announcements , dataset	20	2619	August 12, 2019
How to Access Old Release Version of Dataset? Common Voice dataset	0	622	September 7, 2020
Multi-Language-Dataset (Beta) is gone Common Voice issue , dataset	5	673	February 20, 2019

Older English dataset question

Related topics