In Common Voice Corpus 1 English page, the release date is 2019-02-25 (as shown in the screenshot below). However I am trying to find pre 2019-02-25 release. According to this press release from Mozilla, there was at least one earlier (earlier than 2019-02-25) release of Common Voice Corpus 1 English in 2017-11-29. Could someone let me know where I can find this version or any archived versions before 2019-02-25?
What do you need it for? It could be that the version 1 release is just a re-release with a different date.
@fyters Thank you for your comments. As a background, I was skimming through a 2018 paper on Audio Adversarial Examples: Targeted Attacks on Speech-to-Text which mentions the usage of Mozilla Common
Voice. So I wondered what exact data they are referring to.
Evaluation Benchmark. To evaluate the effectiveness of our attack, we construct targeted audio adversarial examples on the first 100 test instances of the Mozilla Common Voice dataset.
Having explained the background, I was searching and found this post Older English dataset question. I wish I ran into this post earlier but there seems to be an older database with file name of sample-000000.mp3, sample-000001.mp3, etc. This is different from the file names of Common Voice Corpus 1 English which is 128 letters long followed by .mp3 extension.
In principle you should be able to take the first 100 lines of the recent release. I imagine that they are added in chronological order.
Each test/train/dev set is generated non-deterministically, meaning that they will vary from release to release even for minor updates. This is to avoid reproducing and perpetuating any demographic skews in each subsequent set.
so I compared validated.tsv of Common Voice Corpus 5.1 and Common Voice Corpus 6.1. Unfortunately, it seems they change. I presume it is because the upvotes and the downvotes of an audio is continuously tallied for an audio file. In other words, a validated.tsv audio might become invalidated.tsv audio, but I have not checked enough to know. If we go back even further, the file name seem to change between pre v1 (sample-#+.mp3), v1 (128 letters + .mp3), and v6 (common_voice_en_#+.mp3). This makes it so hard to compare.
|common_voice_en_22215682.mp3||line 12||test.tsv line 12||line 222,618||train.tsv line 72,826|
|common_voice_en_23730890.mp3||DNE||DNE||line 18||test.tsv line 18|
|common_voice_en_22214552.mp3||line 35||test.tsv line 35||line 60,972||DNE|
|common_voice_en_19698345.mp3||line 41||test.tsv line 40||line 13,479||test.tsv line 9,735|
|common_voice_en_19740330.mp3||line 47||test.tsv line 46||line 3,730||test.tsv line 3,158|
DNE = Does Not Exist
@fyters Didn’t realize that you were an author of the paper. That is nice.
Regarding my following comment above:
I presume it is because the upvotes and the downvotes of an audio is continuously tallied for an audio file. In other words, a validated.tsv audio might become invalidated.tsv audio, but I have not checked enough to know.
I compared Common Voice Corpus 5.1 (validated.tsv) vs Common Voice Corpus 6.1 (invaid.tsv and other.tsv) for English. Assuming that database creator did not change the AudioFileName, here are the stats:
|v5.1\||moved to v6.1\|
with a comment from Common Voice README:
validated contains a list of all clips that have received two or more validations where up_votes > down_votes
invalidated contains a list of all clips that have received two or more validations where down_votes > up_votes, or clips that have received three or more validations where down_votes = up_votes
other contains a list of all clips that have not received sufficient validations to determine their status
Furthermore, here is few examples that moved from v5.1\validated.tsv to v6.1\other.tsv.
|common_voice_en_12136946.mp3||line 964,803||line 200||up_votes: 2 -> 1|
|common_voice_en_12308591.mp3||line 964,806||line 201||up_votes: 2 -> 1|
|common_voice_en_12309701.mp3||line 964,810||line 202||up_votes: 2 -> 1|
|common_voice_en_1324731.mp3||line 18,620||line 192||up_votes: 2 -> 1|