Older English dataset question

Apologies if this question has already been answered and I’m not sure where this question goes, but I downloaded an earlier version of the English dataset in June 2018 and it contains ~469h, with the following structure:

/cv-invalid
/cv-other-dev
/cv-other-test
/cv-other-train
/cv-valid-dev
/cv-valid-test
/cv-valid-train

and audio as: sample-000000.mp3, sample-000001.mp3, sample-000002.mp3, sample-000003.mp3, etc. within each subdirectory.

Visiting CommonVoice recently, I see there’s a lot more en data since 2018, but the current version has a different structure and audio file naming convention. Assuming this is a superset of my older version, I’m hoping to use it as test and the larger recent complement as a training set.

To do this, I need to know which audio files overlap in the current version, any chance there is a key for this? Any help would be appreciated.

Does your dataset include the file describing each file and the corresponding sentence? It should include an id.

Unfortunately, no client ID in the .csv files in subfolders of the 2018 set. The only columns are:

filename text up_votes down_votes age gender accent duration

Let me check with the team in case someone knows better about the 2018 dataset.

@mhenretty is this something you remember how we generated at the time?

The team suspects IDs might just be continuous numbers but we are not sure.

I’ll keep my fingers crossed someone remembers but I can always use a different training set instead.

Thanks for checking with the team, I really appreciate the data!

@h_caulfield, @nukeador any information regarding the following will be greatly appreciated:

  • Is it possible to download this dataset? It seems that it is no longer available from Common Voice Datasets. Although Common Voice Corpus 1 (2019-02-25) is available for download, this version is different from your version since the file names for Common Voice Corpus 1 (2019-02-25) do not follow sample_#+.mp3 convention of your version.

  • Any luck with finding out overlaps since the file name convention has change? Just curious as to whether this has been resolved.

Lasty, although not related to this topic, I wish the Mozilla Discourse support “Accepted Answer” feature like in stackoverflow so that we can know whether it has been resolved. I do not see such feature mentioned in Who has which moderation powers on Discourse?. I guess Mozilla Discourse is different from Q&A type of community i.e. stackoverflow.