Inadequate Documentation

I find that CommonVoice documentation is inadequate (incomplete).

  1. I wish the Common Voice > ABOUT had a link to:
  2. I also wish each dataset (e.g. English) had the following information:
    1. NUMBER OF AUDIO FILES (in clip/)
  3. If NUMBER OF VOICES means number of speakers, then NUMBER OF VOICES should be translated to 話者数 (NUMBER OF SPEAKERS) and not 音声数 (NUMBER OF AUDIO) in the Japanese site:
  4. Presuming Common Voice Dataset (GitHub) is maintained by Mozilla
    1. cv-dataset/CHANGELOG.md is not updated. It doesn’t reflect Common Voice Corpus 10.0.
    2. Regarding the Fields section
      1. It is missing description on the locale field. I also feel the field name should be language not locale, since it uses values such as ja and yue taken from ISO 639-3 language code standard
      2. It doesn’t describe the fields (sentence_id and reason) that are only present in reported.tsv.
        (not that it is that important … reported.tsv doesn’t have an empty line at the end like other *.tsv files)
      3. There is a misspelling and format in demographic spec which has entry of southatlandtic but in actuality should be south_atlantic (adhereing to other entries there which are in snake case convention)
    3. What is the relationship between each of the .tsv files since validated.tsv (Count = 1,589,008) != train.tsv + dev.tsv + test.tsv (Count = 954,094). I believe the relationship between the clip/ and *.tsv should be documented, like the following which seems to be true.
      • audio in clip/ folder seems to be validated.tsv + invalidated.tsv + other.tsv
      • validated.tsv & invalidated.tsv & other.tsv are mutually exclusive
File Count ※
dev.tsv 16,345
invalidated.tsv 248,337
other.tsv 293,021
reported.tsv 4,169
test.tsv 16,345
train.tsv 921,404
validated.tsv 1,589,008

※ doesn’t inlcude header line in the count

Hi @makoto_wada_jp, I hope I can give some pointers.

The github link exist at the footer on every page… But why not? There can be an entry “how can I contribute to code?”… But I think that page is mainly directed to newcomers, who are thinking to contribute with their voices, that might be the reason.

Yes, those detail are in the metadata repo…

The pages are translated by volunteers, you may like to join Pontoon Common Voice Japanese and suggest the correction. Also, the translations are at 62%, your community will welcome your contribution :slight_smile:

Perhaps open an issue on the repo or make a PR for these?

Some info about the clips & splits:

  • Under the clips directory, there are ALL clips recorded, even rejected ones.
  • All clips are divided into three categories: validated, invalidated and other. Other includes those not yet validated or invalidated.
  • The train/dev/test splits are generated from validated with the code in the CorporaCreator repo. It does NOT take the whole dataset and divide it randomly to e.g. 80-10-10% as you might have seen in other projects. There are some rules in effect that limits the training set:
    • The sample sizes are calculated via a statistical confidence algorithm
    • No sentence can be in more than one set (to prevent sentence bias)
    • No voice (identified by client_id) can be in more than one set (to prevent testing with same voice you trained)
    • By default, one recording from one voice is included (to prevent voice bias).

This decision has been made at the start of the project to produce the default splits, to ensure diversity and scientific correctness, so that the trained model will perform better in the real world. If you don’t use such restrictions, you may get better results during tests, but it will not perform as such in real applications.

But, one can use CorporaCreator with “-s N” option to include more recordings per person to get more into the training set. Also you have the validated.tsv file, you can implement your own splitting algorithm if you want… Actually, we do so to get better results in low-resourced languages, on paper at least :slight_smile:

It also took me a while to find these when I first started volunteering last year… Also, if you search this forum, you can find related discussions on CorporaCreator decisions (keyword “split”), very old topics, back in 2018 or so…

2 Likes

Dear @bozden, thank you for your thorough reply.

I didn’t realise there was an useful information in the footer which is in gray text with black background :rofl:. Thanks for mentioning it. I also noticed that the GitHub link it jumps to is common-voice / common-voice, which is different from what I mentioned above: common-voice / cv-dataset :+1:. I wish FAQ, Discourse, Contact and GitHub in the footer were all in the header … but I understand what you mean below:

I am really impressed with your knowledge of the database. Although what you mention below are mostly captured in the documentation of CorporaCreator, your version below is more concise and better explains its design motivation :grinning:.

However, I do have 2 questions:

  1. Regarding below, where can I find the metadata that shows the NUMBER OF AUDIO FILES in the clip/ directory?
  1. When you say translations are at 62% below, is this for Japanese or the whole site? Just curious as to where you got the figure. If it is Japanese, then I might contribute since that seems very low.

PS: I should mention this somewhere else, but I did find a broken link in CorporaCreator.

For example, the cleaning for English would be done by the en() method in a file named en.py:

Can this help? Line 904 is for v10.0 Japanese total clips.

62% is for Japanese. How to do is under the items in the “about” page we are talking about. It seems last year’s sentence additions are not translated.

Please check this: https://github.com/common-voice/CorporaCreator/tree/master/src/corporacreator/preprocessors

There are no English preprocessors, therefore the wording uses “would be”… Putting a non-existing link is not a good way thou…

1 Like

Overall I’m with you, there is no simple diagram showing how the whole system works. Some processes are triggered manually during the release process, for example, CorporaCreator.

But the place for it would be github, not CV frontend… AI/ML by itself is a tech/math/science heavy thing and the frontend is for common people. UX comes in play in these cases, whether we like it or not :slight_smile:

1 Like

Thank you for your reply. I am all set. Just to reiterate, your answers in my own words.

1.The meta information (e.g. NUMBER OF AUDIO FILES in the clip/ directory, non-header entry count in *.tsv, …) about each languages can be found here:

GitHub: common-voice / cv-dataset / datasets

2.You can see website localization completion percentages through the Mozilla’s Pontoon platform. I saw it in action through here:


PS: You are right, I overlooked the words ”would be”. Yes, I was mislead by the link (to mean the file exist), but I guess the link is just to point out where the file should be placed if it was to implemented. Thanks.

Glad to be of help :slight_smile:

1 Like

BTW, it is not technical, ,but you may also like to read the Community Playbook:

1 Like

Well, thus, I think this is not completely out of interest to PR this…

1 Like

Oh shoot, if only someone was, well, brave enough (?) to try to draw an overview, could it be interesting ? … Indeed, I was thinking much of it, and your reply gave my the kick to do it :slight_smile:

I’ll start a new thread for this discussion. Goto…