I find that CommonVoice documentation is inadequate (incomplete).
- I wish the Common Voice > ABOUT had a link to:
- I also wish each dataset (e.g. English) had the following information:
- NUMBER OF AUDIO FILES (in clip/)
- If NUMBER OF VOICES means number of speakers, then NUMBER OF VOICES should be translated to 話者数 (NUMBER OF SPEAKERS) and not 音声数 (NUMBER OF AUDIO) in the Japanese site:
- Presuming Common Voice Dataset (GitHub) is maintained by Mozilla
- cv-dataset/CHANGELOG.md is not updated. It doesn’t reflect Common Voice Corpus 10.0.
- Regarding the Fields section
- It is missing description on the locale field. I also feel the field name should be language not locale, since it uses values such as ja and yue taken from ISO 639-3 language code standard
- It doesn’t describe the fields (sentence_id and reason) that are only present in reported.tsv.
(not that it is that important … reported.tsv doesn’t have an empty line at the end like other *.tsv files) - There is a misspelling and format in demographic spec which has entry of southatlandtic but in actuality should be south_atlantic (adhereing to other entries there which are in snake case convention)
- What is the relationship between each of the .tsv files since validated.tsv (Count = 1,589,008) != train.tsv + dev.tsv + test.tsv (Count = 954,094). I believe the relationship between the clip/ and *.tsv should be documented, like the following which seems to be true.
- audio in clip/ folder seems to be validated.tsv + invalidated.tsv + other.tsv
- validated.tsv & invalidated.tsv & other.tsv are mutually exclusive
File | Count ※ |
---|---|
dev.tsv | 16,345 |
invalidated.tsv | 248,337 |
other.tsv | 293,021 |
reported.tsv | 4,169 |
test.tsv | 16,345 |
train.tsv | 921,404 |
validated.tsv | 1,589,008 |
※ doesn’t inlcude header line in the count