Where is the documentation regarding Mozilla Common Voice database?

makoto_wada_jp · June 10, 2021, 7:21am

I can not find any documentation regarding Mozilla Common Voice except for: Common Voice > How does it work?.
So here are the things I would like to know about the database.

Where can I find a documentation and revision history of the database (v1, v2, … v5.1, v6.1) ?
What does each *.tsv file represent especially the following:
- invalidated.tsv
- other.tsv
- reported.tsv
- validated.tsv
  and which of the tsv file represent Clip Graveyard mentioned in Common Voice > How does it work??
What does each header represent in the *.tsv file, especially “segment” & “reason”.
I presume, “up_votes” and “down_votes” are related to “>= 2 Yes votes” and “>= No votes” mentioned in Common Voice > How does it work?.
The header for reported.tsv is different from the rest, but I guess I will know why if what each file represents.

HEADER (Y = Exist, N = Does Not Exist)

file	client_id	path	sentence	sentence_id	up_votes	down_votes	age	gender	accent	locale	segment	reason
reported.tsv	Y	Y	Y	N	Y	Y	Y	Y	Y	Y	Y	N
the rest of *.tsv	N	N	Y	Y	N	N	N	N	N	Y	N	Y

Is the database download incremental or full version?
In other words,
- a) Can I just download v6.1 and it will contain everything from v1 … v5.1?
  or
- b) Do I need to download v1, v2, … v5.1, v6.1 to construct the full database

Joshua_G · June 12, 2021, 6:30am

For a privacy browser, a little transparency would be appreciated. You lose your privacy “title” and you’ll be using chrome on this message board.

ftyers · June 12, 2021, 6:03pm

I don’t think there is any.
invalidated = has downvoted, other = excluded, reported = has been reported by a user, validated = has been upvoted enough
segment = if the clip is part of a particular subcorpus (e.g. the target segments subcorpus), reason = why the clip/transcript was rejected
it’s different because you have the reason the clip was reported.
each release is a full dump, not incremental. you only need to download the latest release.

Hope this helps!

Fran

makoto_wada_jp · June 14, 2021, 7:57am

@ftyers Thank you for your answer. Yes it is very helpful. I do have additional question:

Regarding other.tsv = excluded, was it excluded because it doesn’t have enough upvotes or downvotes to be either in validated.tsv or invalidated.tsv, repsectively?

For segment, I see that all *.tsv that is not reported.tsv is either empty “” (2,131,602 cases) or has the value “Benchmark” (49,384 cases [validated.tsv [32,726 cases] + invalidated.tsv [4,071 cases] + other.tsv [12,573 cases] + test.tsv [14 cases]) for CCommon Voice Corpus 6.1 English.

So if you happen to know what the “Benchmark” is, then please let me know. I am a little confused with the word “Benchmark” when the database has train.tsv, dev.tsv, and test.tsv. In other words, test.tsv sounds like the “Benchmark”.

makoto_wada_jp · June 14, 2021, 8:09am

@Joshua_G Thank you for your comments, I am relatively new to Mozilla Discourse so please excuse me if I sound unsavvy with the following questions:

What do you mean by privacy browser?
What is the transparency that is required?

For the above, I am using a chrome browser. However I do not understand what you mean by “title” in the above passage?

ftyers · June 14, 2021, 1:37pm

Yes Benchmark is the fixed-vocab subset (0…9 and yes/no), e.g.

other.tsv:e0230fbbdc872252b94443dc3fa5d50c0c8845ba44f6485ff57a2b7b0cf5db94cd1b69c7ca10634e630917815aaa4e6787a034eb82f6c1dfe406d5a2930078ae	common_voice_en_22430937.mp3	seven	0	0	fourties	male		en	Benchmark

makoto_wada_jp · June 15, 2021, 10:33am

@ftyers Thank you for all your comments thus far. For your information, when I was searching for other information regarding Common Voice Dataset … I found:

I just wanted to share the information with you.

heyhillary · November 16, 2021, 3:23pm

Hey @makoto_wada_jp

There is also the Common Voice metadata available here: https://github.com/common-voice/cv-dataset

makoto_wada_jp · January 24, 2022, 8:04am

@heyhillary Yes. Thank you for pointing it out. The link you’ve provided (https://github.com/common-voice/cv-dataset) is the same as the link for README.md right above your reply.

Sorry for the confusing way of linking but I much appreciate your response.

heyhillary · January 24, 2022, 11:12am

Hey @makoto_wada_jp,

No worreis, I didn’t realise that it was the same link.

just to note, we have recently updated the Common Voice about page to visbilise the documentation regarding the dataset.