Where is the documentation regarding Mozilla Common Voice database?

I can not find any documentation regarding Mozilla Common Voice except for: Common Voice > How does it work?.
So here are the things I would like to know about the database.

  1. Where can I find a documentation and revision history of the database (v1, v2, … v5.1, v6.1) ?
  2. What does each *.tsv file represent especially the following:
  3. What does each header represent in the *.tsv file, especially “segment” & “reason”.
    I presume, “up_votes” and “down_votes” are related to “>= 2 Yes votes” and “>= No votes” mentioned in Common Voice > How does it work?.
  4. The header for reported.tsv is different from the rest, but I guess I will know why if what each file represents.

HEADER (Y = Exist, N = Does Not Exist)

file client_id path sentence sentence_id up_votes down_votes age gender accent locale segment reason
reported.tsv Y Y Y N Y Y Y Y Y Y Y N
the rest of *.tsv N N Y Y N N N N N Y N Y
  1. Is the database download incremental or full version?
    In other words,
    • a) Can I just download v6.1 and it will contain everything from v1 … v5.1?
          or
    • b) Do I need to download v1, v2, … v5.1, v6.1 to construct the full database
1 Like

For a privacy browser, a little transparency would be appreciated. You lose your privacy “title” and you’ll be using chrome on this message board.

  1. I don’t think there is any.
  2. invalidated = has downvoted, other = excluded, reported = has been reported by a user, validated = has been upvoted enough
  3. segment = if the clip is part of a particular subcorpus (e.g. the target segments subcorpus), reason = why the clip/transcript was rejected
  4. it’s different because you have the reason the clip was reported.
  5. each release is a full dump, not incremental. you only need to download the latest release.

Hope this helps!

Fran

1 Like

@ftyers Thank you for your answer. Yes it is very helpful. I do have additional question:

Regarding other.tsv = excluded, was it excluded because it doesn’t have enough upvotes or downvotes to be either in validated.tsv or invalidated.tsv, repsectively?

For segment, I see that all *.tsv that is not reported.tsv is either empty “” (2,131,602 cases) or has the value “Benchmark” (49,384 cases [validated.tsv [32,726 cases] + invalidated.tsv [4,071 cases] + other.tsv [12,573 cases] + test.tsv [14 cases]) for CCommon Voice Corpus 6.1 English.

So if you happen to know what the “Benchmark” is, then please let me know. I am a little confused with the word “Benchmark” when the database has train.tsv, dev.tsv, and test.tsv. In other words, test.tsv sounds like the “Benchmark”.

@Joshua_G Thank you for your comments, I am relatively new to Mozilla Discourse so please excuse me if I sound unsavvy with the following questions:

  1. What do you mean by privacy browser?
  2. What is the transparency that is required?
  1. For the above, I am using a chrome browser. However I do not understand what you mean by “title” in the above passage?

Yes Benchmark is the fixed-vocab subset (0…9 and yes/no), e.g.

other.tsv:e0230fbbdc872252b94443dc3fa5d50c0c8845ba44f6485ff57a2b7b0cf5db94cd1b69c7ca10634e630917815aaa4e6787a034eb82f6c1dfe406d5a2930078ae	common_voice_en_22430937.mp3	seven	0	0	fourties	male		en	Benchmark

@ftyers Thank you for all your comments thus far. For your information, when I was searching for other information regarding Common Voice Dataset … I found:

I just wanted to share the information with you.

1 Like

Hey @makoto_wada_jp

There is also the Common Voice metadata available here: https://github.com/common-voice/cv-dataset

1 Like

@heyhillary Yes. Thank you for pointing it out. The link you’ve provided (https://github.com/common-voice/cv-dataset) is the same as the link for README.md right above your reply.

Sorry for the confusing way of linking but I much appreciate your response.

Hey @makoto_wada_jp,

No worreis, I didn’t realise that it was the same link.

just to note, we have recently updated the Common Voice about page to visbilise the documentation regarding the dataset.