While I’m here, let me add some more thoughts…
First of all Common Voice has come a long way from 2018 when I first registered. The current roadmap and weekly posts are showing a bright future. But there is much work to do.
Community
Here, the word “community” is understood in its broader sense - I think (all people donating to language)… From my perspective (civil society terminology) a community is a group of people interacting with each other / working together by being in a place. There are 900+ different voices and 30h validated recordings in Turkish, but CV-Turkish community size is now 1.
People come…
- They might or might not read the webpages (FAQ, how to pages, PlayBook)
- They might or not register
- They most probably do not enter Discourse, never to Matrix etc…
- A very large portion of them would never know about machine learning, it is not expected.
- Some them even do not excel in their mother tongue either.
They come in and record, some also listen and they exit. In their free time…
Community Needs
They have no guidance. They wouldn’t know ML requirements or how to solve language specific cases/problems. There is no specification on how much they should record and when to stop anywhere. They have only very basic guidance and simple examples - if adapted, otherwise they are stuck with the dinosaurs. If they reach Discourse it is in English, even examples and the excellent Playbook (well, translation is in my to-do list). If there is a sub-Discourse for their language it is not easy to find either. Google and search works but they are not visible inside Discourse.
A language community -from my perspective- is an e-NGO, or rather a regional office of an international one, and should act as one. In that NGO there must be designers, marketing specialists, data analysts, ML experts etc. Dataset problems and/or expansion of it are language specific and should be monitored by a group of more knowledgeable people. So, like @ftyers suggested in the first reply, a “language board” might suggest a specific direction of the language, such as disabling recording for some people in that language - “CV board” must be in the loop of course. CV could provide the necessary SW changes to facilitate this and that. So, with a couple of interactive settings the behavior in that language could change (It would not be wise to implement such measures globally IMHO).
But for any of this to work:
- Languages should have sub-forums and put language specific info in there and communicate there. Alternative is to form it outside…
- New & old users should be directed to that forum (if outside, this is not advisable)
- New users get an e-mail after registration to direct them to that sub-forum.
- People donating to a language should be accessible via secure e-communication.
- …
Currently, for Turkish, we could do the first item. Others can be done in time, but the last one is not possible AFAIK…
Status
For me and Turkish, status is as follows:
- On March 2021, when I re-started to work on this topic, I did 800+ recordings but stopped for the sake of diversity. Some did not and tripled their recordings.
- I was away for 3 months, when I returned, about 6700 recordings (calculated from corpus data) were waiting to be controlled, I did more than 2k+ but what about the second check?
- 11k+ sentences are added (only) by me, one person added more but had to be deleted (bad source & malformed divisions). This is the toughest area and needs more people.
- Female voice percentage is %6, I hear only one now. All are young males. And there is an unknown 24%. With current privacy measures one can listen to them, or just reject them.
- From Corpus v6.1, I listened to many recordings where accept=2 & reject=1 and many of them should be rejected. People just ignore small mistakes - they don’t know!
- There was one person who recorded many clips (guess: 1000), with a heavy accent, good to have that, but not that much. But another one came in line, someone who shouts and/or changes his voice in many funny ways. I had to reject many of those recordings.
- …
So, this is not a good, healthy dataset…
Possible measures
In a perfect world what measures would be needed (all with time scale and non-identifiable) for health analysis AND performance? Some could be:
- Real time, daily or weekly data on all measures (whichever possible from the current infrastructure - we can only have it from Corpus summary and/or data)
- Total/new subscribers for a language, their gender/age distributions
- Total/new recording / listening / sentence addition & verification numbers, their cross tables / graphics / frequency distributions / ratios to answer questions such as: “How much listening do people with recording >100”.
- Who are the people with most recordings rejected? How many total rejects?
- What are the number of active volunteers. How many did we loose / never visit? How much time do they spend?
- How many of them are also active in Discourse / UI translation etc…
- …
There might be many more from volunteer management area…
Edit (21.09.2021) - Additions to possible measures:
- Number of downloads from each Corpus version
- Distributions like recordings/visit, listening/visit, sentence addition/visit, sentence verification/visit and how much time they spend…