Inadequate Documentation

makoto_wada_jp · September 15, 2022, 12:25am

I find that CommonVoice documentation is inadequate (incomplete).

I wish the Common Voice > ABOUT had a link to:
- Common Voice Dataset (GitHub)
I also wish each dataset (e.g. English) had the following information:
1. NUMBER OF AUDIO FILES (in clip/)
If NUMBER OF VOICES means number of speakers, then NUMBER OF VOICES should be translated to 話者数 (NUMBER OF SPEAKERS) and not 音声数 (NUMBER OF AUDIO) in the Japanese site:
Presuming Common Voice Dataset (GitHub) is maintained by Mozilla
1. cv-dataset/CHANGELOG.md is not updated. It doesn’t reflect Common Voice Corpus 10.0.
2. Regarding the Fields section
  1. It is missing description on the locale field. I also feel the field name should be language not locale, since it uses values such as ja and yue taken from ISO 639-3 language code standard
  2. It doesn’t describe the fields (sentence_id and reason) that are only present in reported.tsv.
    (not that it is that important … reported.tsv doesn’t have an empty line at the end like other *.tsv files)
  3. There is a misspelling and format in demographic spec which has entry of southatlandtic but in actuality should be south_atlantic (adhereing to other entries there which are in snake case convention)
3. What is the relationship between each of the .tsv files since validated.tsv (Count = 1,589,008) != train.tsv + dev.tsv + test.tsv (Count = 954,094). I believe the relationship between the clip/ and *.tsv should be documented, like the following which seems to be true.
  - audio in clip/ folder seems to be validated.tsv + invalidated.tsv + other.tsv
  - validated.tsv & invalidated.tsv & other.tsv are mutually exclusive

File	Count ※
dev.tsv	16,345
invalidated.tsv	248,337
other.tsv	293,021
reported.tsv	4,169
test.tsv	16,345
train.tsv	921,404
validated.tsv	1,589,008

※ doesn’t inlcude header line in the count

bozden · September 16, 2022, 12:04am

Hi @makoto_wada_jp, I hope I can give some pointers.

The github link exist at the footer on every page… But why not? There can be an entry “how can I contribute to code?”… But I think that page is mainly directed to newcomers, who are thinking to contribute with their voices, that might be the reason.

Yes, those detail are in the metadata repo…

The pages are translated by volunteers, you may like to join Pontoon Common Voice Japanese and suggest the correction. Also, the translations are at 62%, your community will welcome your contribution

Perhaps open an issue on the repo or make a PR for these?

Some info about the clips & splits:

Under the clips directory, there are ALL clips recorded, even rejected ones.
All clips are divided into three categories: validated, invalidated and other. Other includes those not yet validated or invalidated.
The train/dev/test splits are generated from validated with the code in the CorporaCreator repo. It does NOT take the whole dataset and divide it randomly to e.g. 80-10-10% as you might have seen in other projects. There are some rules in effect that limits the training set:
- The sample sizes are calculated via a statistical confidence algorithm
- No sentence can be in more than one set (to prevent sentence bias)
- No voice (identified by client_id) can be in more than one set (to prevent testing with same voice you trained)
- By default, one recording from one voice is included (to prevent voice bias).

This decision has been made at the start of the project to produce the default splits, to ensure diversity and scientific correctness, so that the trained model will perform better in the real world. If you don’t use such restrictions, you may get better results during tests, but it will not perform as such in real applications.

But, one can use CorporaCreator with “-s N” option to include more recordings per person to get more into the training set. Also you have the validated.tsv file, you can implement your own splitting algorithm if you want… Actually, we do so to get better results in low-resourced languages, on paper at least

It also took me a while to find these when I first started volunteering last year… Also, if you search this forum, you can find related discussions on CorporaCreator decisions (keyword “split”), very old topics, back in 2018 or so…

makoto_wada_jp · September 15, 2022, 11:04pm

Dear @bozden, thank you for your thorough reply.

I didn’t realise there was an useful information in the footer which is in gray text with black background . Thanks for mentioning it. I also noticed that the GitHub link it jumps to is common-voice / common-voice, which is different from what I mentioned above: common-voice / cv-dataset . I wish FAQ, Discourse, Contact and GitHub in the footer were all in the header … but I understand what you mean below:

I am really impressed with your knowledge of the database. Although what you mention below are mostly captured in the documentation of CorporaCreator, your version below is more concise and better explains its design motivation .

However, I do have 2 questions:

Regarding below, where can I find the metadata that shows the NUMBER OF AUDIO FILES in the clip/ directory?

When you say translations are at 62% below, is this for Japanese or the whole site? Just curious as to where you got the figure. If it is Japanese, then I might contribute since that seems very low.

PS: I should mention this somewhere else, but I did find a broken link in CorporaCreator.

For example, the cleaning for English would be done by the en() method in a file named en.py:

bozden · September 15, 2022, 11:57am

Can this help? Line 904 is for v10.0 Japanese total clips.

github.com

common-voice/cv-dataset/blob/f1e9957ddc5ef631b9ca8d99172e3798390090eb/datasets/cv-corpus-10.0-2022-07-04.json#L904


"buckets": {
  "dev": 4312,
  "invalidated": 2262,
  "other": 370,
  "reported": 153,
  "test": 4489,
  "train": 6352,
  "validated": 36021
},
"reportedSentences": 153,
"clips": 38653,
"splits": {
  "accent": { "": 1 },
  "age": {
    "twenties": 0.32,
    "": 0.23,
    "teens": 0.04,
    "fifties": 0.01,
    "thirties": 0.1,
    "fourties": 0.29,
    "sixties": 0,

62% is for Japanese. How to do is under the items in the “about” page we are talking about. It seems last year’s sentence additions are not translated.

Please check this: https://github.com/common-voice/CorporaCreator/tree/master/src/corporacreator/preprocessors

There are no English preprocessors, therefore the wording uses “would be”… Putting a non-existing link is not a good way thou…

bozden · September 15, 2022, 12:10pm

Overall I’m with you, there is no simple diagram showing how the whole system works. Some processes are triggered manually during the release process, for example, CorporaCreator.

But the place for it would be github, not CV frontend… AI/ML by itself is a tech/math/science heavy thing and the frontend is for common people. UX comes in play in these cases, whether we like it or not

makoto_wada_jp · September 16, 2022, 12:06am

Thank you for your reply. I am all set. Just to reiterate, your answers in my own words.

１．The meta information (e.g. NUMBER OF AUDIO FILES in the clip/ directory, non-header entry count in *.tsv, …) about each languages can be found here:

GitHub: common-voice / cv-dataset / datasets

cv-corpus-10.0-2022-07-04.json

cv-corpus-9.0-2022-04-27.json

…

２．You can see website localization completion percentages through the Mozilla’s Pontoon platform. I saw it in action through here:

Common Voice > About > How does site localization work? > WATCH OUR VIDEO EXPLAINER TO HELP (1 min 30 sec)

PS: You are right, I overlooked the words ”would be”. Yes, I was mislead by the link (to mean the file exist), but I guess the link is just to point out where the file should be placed if it was to implemented. Thanks.

bozden · September 15, 2022, 11:59pm

Glad to be of help

bozden · September 16, 2022, 12:26am

BTW, it is not technical, ,but you may also like to read the Community Playbook:

HelloTheWorld · September 22, 2022, 8:15pm

Well, thus, I think this is not completely out of interest to PR this…

HelloTheWorld · September 23, 2022, 12:38pm

Oh shoot, if only someone was, well, brave enough (?) to try to draw an overview, could it be interesting ? … Indeed, I was thinking much of it, and your reply gave my the kick to do it

I’ll start a new thread for this discussion. Goto…