Looking for Common Voice Corpus English before 2019-02-25 (v1) release

In Common Voice Corpus 1 English page, the release date is 2019-02-25 (as shown in the screenshot below). However I am trying to find pre 2019-02-25 release. According to this press release from Mozilla, there was at least one earlier (earlier than 2019-02-25) release of Common Voice Corpus 1 English in 2017-11-29. Could someone let me know where I can find this version or any archived versions before 2019-02-25?

What do you need it for? It could be that the version 1 release is just a re-release with a different date.

@fyters Thank you for your comments. As a background, I was skimming through a 2018 paper on Audio Adversarial Examples: Targeted Attacks on Speech-to-Text which mentions the usage of Mozilla Common
Voice
. So I wondered what exact data they are referring to.

Evaluation Benchmark. To evaluate the effectiveness of our attack, we construct targeted audio adversarial examples on the first 100 test instances of the Mozilla Common Voice dataset.

Having explained the background, I was searching and found this post Older English dataset question. I wish I ran into this post earlier but there seems to be an older database with file name of sample-000000.mp3, sample-000001.mp3, etc. This is different from the file names of Common Voice Corpus 1 English which is 128 letters long followed by .mp3 extension.

  • 0000a0f45a2a9ca26455c76d7abfe5992806f8ad0f014a18616fb7dda86c508753765e61697993e5d2a0d9e2fab52a822b31ed5c3f7f3e5bc37495453f6b335f.mp3
  • 0000a1804c153bbb8cc5360a0b59a4818e7b4639e8948794af5eb2f725bf9c6219d4da66c0ee1bcd911295f87d33fab29165049095de65542efbb1165d33999f.mp3

In principle you should be able to take the first 100 lines of the recent release. I imagine that they are added in chronological order.

@ftyers I learned from their Common Voice README documentation that they change the content of these files …

Each test/train/dev set is generated non-deterministically, meaning that they will vary from release to release even for minor updates. This is to avoid reproducing and perpetuating any demographic skews in each subsequent set.

so I compared validated.tsv of Common Voice Corpus 5.1 and Common Voice Corpus 6.1. Unfortunately, it seems they change. I presume it is because the upvotes and the downvotes of an audio is continuously tallied for an audio file. In other words, a validated.tsv audio might become invalidated.tsv audio, but I have not checked enough to know. If we go back even further, the file name seem to change between pre v1 (sample-#+.mp3), v1 (128 letters + .mp3), and v6 (common_voice_en_#+.mp3). This makes it so hard to compare.

v5.1 v6.1
audioFileName validated.tsv *.tsv validated.tsv *.tsv
common_voice_en_22215682.mp3 line 12 test.tsv line 12 line 222,618 train.tsv line 72,826
common_voice_en_23730890.mp3 DNE DNE line 18 test.tsv line 18
common_voice_en_22214552.mp3 line 35 test.tsv line 35 line 60,972 DNE
common_voice_en_19698345.mp3 line 41 test.tsv line 40 line 13,479 test.tsv line 9,735
common_voice_en_19740330.mp3 line 47 test.tsv line 46 line 3,730 test.tsv line 3,158
...

DNE = Does Not Exist

1 Like

That’s unfortunate. I suspect that the original release is not available without a substantial amount of effort (but @phirework can confirm), we had a similar issue with the Common Voice paper.

1 Like

@fyters Didn’t realize that you were an author of the paper. That is nice.

Regarding my following comment above:

I presume it is because the upvotes and the downvotes of an audio is continuously tallied for an audio file. In other words, a validated.tsv audio might become invalidated.tsv audio, but I have not checked enough to know.

I compared Common Voice Corpus 5.1 (validated.tsv) vs Common Voice Corpus 6.1 (invaid.tsv and other.tsv) for English. Assuming that database creator did not change the AudioFileName, here are the stats:

v5.1\ moved to v6.1\
validated.tsv invalid.tsv other.tsv
549 0 549

with a comment from Common Voice README:

validated contains a list of all clips that have received two or more validations where up_votes > down_votes
invalidated contains a list of all clips that have received two or more validations where down_votes > up_votes, or clips that have received three or more validations where down_votes = up_votes
other contains a list of all clips that have not received sufficient validations to determine their status

Furthermore, here is few examples that moved from v5.1\validated.tsv to v6.1\other.tsv.

AudioFileName v5.1\validated.tsv v6.1\other.tsv Changes
common_voice_en_12136946.mp3 line 964,803 line 200 up_votes: 2 -> 1
common_voice_en_12308591.mp3 line 964,806 line 201 up_votes: 2 -> 1
common_voice_en_12309701.mp3 line 964,810 line 202 up_votes: 2 -> 1
common_voice_en_1324731.mp3 line 18,620 line 192 up_votes: 2 -> 1
...