Older English dataset question

Apologies if this question has already been answered and I’m not sure where this question goes, but I downloaded an earlier version of the English dataset in June 2018 and it contains ~469h, with the following structure:

/cv-invalid
/cv-other-dev
/cv-other-test
/cv-other-train
/cv-valid-dev
/cv-valid-test
/cv-valid-train

and audio as: sample-000000.mp3, sample-000001.mp3, sample-000002.mp3, sample-000003.mp3, etc. within each subdirectory.

Visiting CommonVoice recently, I see there’s a lot more en data since 2018, but the current version has a different structure and audio file naming convention. Assuming this is a superset of my older version, I’m hoping to use it as test and the larger recent complement as a training set.

To do this, I need to know which audio files overlap in the current version, any chance there is a key for this? Any help would be appreciated.

Does your dataset include the file describing each file and the corresponding sentence? It should include an id.

Unfortunately, no client ID in the .csv files in subfolders of the 2018 set. The only columns are:

filename text up_votes down_votes age gender accent duration

Let me check with the team in case someone knows better about the 2018 dataset.

@mhenretty is this something you remember how we generated at the time?

The team suspects IDs might just be continuous numbers but we are not sure.

I’ll keep my fingers crossed someone remembers but I can always use a different training set instead.

Thanks for checking with the team, I really appreciate the data!