I can not find any documentation regarding Mozilla Common Voice except for: Common Voice > How does it work?.
So here are the things I would like to know about the database.
Where can I find a documentation and revision history of the database (v1, v2, … v5.1, v6.1) ?
What does each *.tsv file represent especially the following:
What does each header represent in the *.tsv file, especially “segment” & “reason”.
I presume, “up_votes” and “down_votes” are related to “>= 2 Yes votes” and “>= No votes” mentioned in Common Voice > How does it work?.
The header for reported.tsv is different from the rest, but I guess I will know why if what each file represents.
HEADER (Y = Exist, N = Does Not Exist)
file
client_id
path
sentence
sentence_id
up_votes
down_votes
age
gender
accent
locale
segment
reason
reported.tsv
Y
Y
Y
N
Y
Y
Y
Y
Y
Y
Y
N
the rest of *.tsv
N
N
Y
Y
N
N
N
N
N
Y
N
Y
Is the database download incremental or full version?
In other words,
a) Can I just download v6.1 and it will contain everything from v1 … v5.1?
or
b) Do I need to download v1, v2, … v5.1, v6.1 to construct the full database
@ftyers Thank you for your answer. Yes it is very helpful. I do have additional question:
Regarding other.tsv = excluded, was it excluded because it doesn’t have enough upvotes or downvotes to be either in validated.tsv or invalidated.tsv, repsectively?
For segment, I see that all *.tsv that is not reported.tsv is either empty “” (2,131,602 cases) or has the value “Benchmark” (49,384 cases [validated.tsv [32,726 cases] + invalidated.tsv [4,071 cases] + other.tsv [12,573 cases] + test.tsv [14 cases]) for CCommon Voice Corpus 6.1 English.
So if you happen to know what the “Benchmark” is, then please let me know. I am a little confused with the word “Benchmark” when the database has train.tsv, dev.tsv, and test.tsv. In other words, test.tsv sounds like the “Benchmark”.
Yes Benchmark is the fixed-vocab subset (0…9 and yes/no), e.g.
other.tsv:e0230fbbdc872252b94443dc3fa5d50c0c8845ba44f6485ff57a2b7b0cf5db94cd1b69c7ca10634e630917815aaa4e6787a034eb82f6c1dfe406d5a2930078ae common_voice_en_22430937.mp3 seven 0 0 fourties male en Benchmark
@ftyers Thank you for all your comments thus far. For your information, when I was searching for other information regarding Common Voice Dataset … I found: