Apologies if this question has already been answered and I’m not sure where this question goes, but I downloaded an earlier version of the English dataset in June 2018 and it contains ~469h, with the following structure:
/cv-invalid
/cv-other-dev
/cv-other-test
/cv-other-train
/cv-valid-dev
/cv-valid-test
/cv-valid-train
and audio as: sample-000000.mp3, sample-000001.mp3, sample-000002.mp3, sample-000003.mp3, etc. within each subdirectory.
Visiting CommonVoice recently, I see there’s a lot more en data since 2018, but the current version has a different structure and audio file naming convention. Assuming this is a superset of my older version, I’m hoping to use it as test and the larger recent complement as a training set.
To do this, I need to know which audio files overlap in the current version, any chance there is a key for this? Any help would be appreciated.