2566 sound clips without data in english dataset

After I was finish building a triee index, were I move clips to another directory. I discovered that some clips where left back. These clips are not mentioned in the .tsv files.

Is this an error in the data dump procedure?

cyberty /store/Download/common_voice/en # ls clips/ | wc -l
2566
cyberty /store/Download/common_voice/en # ls clips/ | sort | head
common_voice_en_20232338.mp3
common_voice_en_20232339.mp3
common_voice_en_20232340.mp3
common_voice_en_20232341.mp3
common_voice_en_20232342.mp3
common_voice_en_20232353.mp3
common_voice_en_20232355.mp3
common_voice_en_20232357.mp3
common_voice_en_20232359.mp3
common_voice_en_20232361.mp3

Hi @isomorph70, thanks for flagging this, I’ll look into it. The current datadump procedure can be a little fragile with how big our dataset has gotten and we’re working on improving that whole process.

1 Like