Looking for Common Voice Corpus English before 2019-02-25 (v1) release

makoto_wada_jp · June 14, 2021, 8:21am

In Common Voice Corpus 1 English page, the release date is 2019-02-25 (as shown in the screenshot below). However I am trying to find pre 2019-02-25 release. According to this press release from Mozilla, there was at least one earlier (earlier than 2019-02-25) release of Common Voice Corpus 1 English in 2017-11-29. Could someone let me know where I can find this version or any archived versions before 2019-02-25?

ftyers · June 14, 2021, 1:43pm

What do you need it for? It could be that the version 1 release is just a re-release with a different date.

makoto_wada_jp · June 15, 2021, 10:37am

@fyters Thank you for your comments. As a background, I was skimming through a 2018 paper on Audio Adversarial Examples: Targeted Attacks on Speech-to-Text which mentions the usage of Mozilla Common
Voice. So I wondered what exact data they are referring to.

Evaluation Benchmark. To evaluate the effectiveness of our attack, we construct targeted audio adversarial examples on the first 100 test instances of the Mozilla Common Voice dataset.

Having explained the background, I was searching and found this post Older English dataset question. I wish I ran into this post earlier but there seems to be an older database with file name of sample-000000.mp3, sample-000001.mp3, etc. This is different from the file names of Common Voice Corpus 1 English which is 128 letters long followed by .mp3 extension.

0000a0f45a2a9ca26455c76d7abfe5992806f8ad0f014a18616fb7dda86c508753765e61697993e5d2a0d9e2fab52a822b31ed5c3f7f3e5bc37495453f6b335f.mp3

0000a1804c153bbb8cc5360a0b59a4818e7b4639e8948794af5eb2f725bf9c6219d4da66c0ee1bcd911295f87d33fab29165049095de65542efbb1165d33999f.mp3

…

ftyers · June 15, 2021, 3:34pm

In principle you should be able to take the first 100 lines of the recent release. I imagine that they are added in chronological order.

makoto_wada_jp · June 19, 2021, 2:04am

@ftyers I learned from their Common Voice README documentation that they change the content of these files …

Each test/train/dev set is generated non-deterministically, meaning that they will vary from release to release even for minor updates. This is to avoid reproducing and perpetuating any demographic skews in each subsequent set.

so I compared validated.tsv of Common Voice Corpus 5.1 and Common Voice Corpus 6.1. Unfortunately, it seems they change. I presume it is because the upvotes and the downvotes of an audio is continuously tallied for an audio file. In other words, a validated.tsv audio might become invalidated.tsv audio, but I have not checked enough to know. If we go back even further, the file name seem to change between pre v1 (sample-#+.mp3), v1 (128 letters + .mp3), and v6 (common_voice_en_#+.mp3). This makes it so hard to compare.

	v5.1		v6.1
audioFileName	validated.tsv	*.tsv	validated.tsv	*.tsv
common_voice_en_22215682.mp3	line 12	test.tsv line 12	line 222,618	train.tsv line 72,826
common_voice_en_23730890.mp3	DNE	DNE	line 18	test.tsv line 18
common_voice_en_22214552.mp3	line 35	test.tsv line 35	line 60,972	DNE
common_voice_en_19698345.mp3	line 41	test.tsv line 40	line 13,479	test.tsv line 9,735
common_voice_en_19740330.mp3	line 47	test.tsv line 46	line 3,730	test.tsv line 3,158
...

DNE = Does Not Exist

ftyers · June 19, 2021, 3:08am

That’s unfortunate. I suspect that the original release is not available without a substantial amount of effort (but @phirework can confirm), we had a similar issue with the Common Voice paper.

makoto_wada_jp · June 21, 2021, 8:56am

@fyters Didn’t realize that you were an author of the paper. That is nice.

Regarding my following comment above:

I presume it is because the upvotes and the downvotes of an audio is continuously tallied for an audio file. In other words, a validated.tsv audio might become invalidated.tsv audio, but I have not checked enough to know.

I compared Common Voice Corpus 5.1 (validated.tsv) vs Common Voice Corpus 6.1 (invaid.tsv and other.tsv) for English. Assuming that database creator did not change the AudioFileName, here are the stats:

v5.1\	moved to v6.1\
validated.tsv	invalid.tsv	other.tsv
549	0	549

with a comment from Common Voice README:

validated contains a list of all clips that have received two or more validations where up_votes > down_votes
invalidated contains a list of all clips that have received two or more validations where down_votes > up_votes, or clips that have received three or more validations where down_votes = up_votes
other contains a list of all clips that have not received sufficient validations to determine their status

Furthermore, here is few examples that moved from v5.1\validated.tsv to v6.1\other.tsv.

AudioFileName	v5.1\validated.tsv	v6.1\other.tsv	Changes
common_voice_en_12136946.mp3	line 964,803	line 200	up_votes: 2 -> 1
common_voice_en_12308591.mp3	line 964,806	line 201	up_votes: 2 -> 1
common_voice_en_12309701.mp3	line 964,810	line 202	up_votes: 2 -> 1
common_voice_en_1324731.mp3	line 18,620	line 192	up_votes: 2 -> 1
...

Topic		Replies	Views
Older English dataset question Common Voice dataset	6	1533	June 15, 2021
Pre Release Data vs Latest Release Data Common Voice dataset	1	482	April 2, 2019
Common Voice Dataset Release - Mid Year 2020 Common Voice announcements	15	24396	August 21, 2020
Dowloading updated common voice data Common Voice	3	543	December 17, 2018
Where is the documentation regarding Mozilla Common Voice database? Common Voice issue	9	2842	January 24, 2022

Looking for Common Voice Corpus English before 2019-02-25 (v1) release

Related topics