While Iām here, let me add some more thoughtsā¦
First of all Common Voice has come a long way from 2018 when I first registered. The current roadmap and weekly posts are showing a bright future. But there is much work to do.
Community
Here, the word ācommunityā is understood in its broader sense - I think (all people donating to language)ā¦ From my perspective (civil society terminology) a community is a group of people interacting with each other / working together by being in a place. There are 900+ different voices and 30h validated recordings in Turkish, but CV-Turkish community size is now 1.
People comeā¦
- They might or might not read the webpages (FAQ, how to pages, PlayBook)
- They might or not register
- They most probably do not enter Discourse, never to Matrix etcā¦
- A very large portion of them would never know about machine learning, it is not expected.
- Some them even do not excel in their mother tongue either.
They come in and record, some also listen and they exit. In their free timeā¦
Community Needs
They have no guidance. They wouldnāt know ML requirements or how to solve language specific cases/problems. There is no specification on how much they should record and when to stop anywhere. They have only very basic guidance and simple examples - if adapted, otherwise they are stuck with the dinosaurs. If they reach Discourse it is in English, even examples and the excellent Playbook (well, translation is in my to-do list). If there is a sub-Discourse for their language it is not easy to find either. Google and search works but they are not visible inside Discourse.
A language community -from my perspective- is an e-NGO, or rather a regional office of an international one, and should act as one. In that NGO there must be designers, marketing specialists, data analysts, ML experts etc. Dataset problems and/or expansion of it are language specific and should be monitored by a group of more knowledgeable people. So, like @ftyers suggested in the first reply, a ālanguage boardā might suggest a specific direction of the language, such as disabling recording for some people in that language - āCV boardā must be in the loop of course. CV could provide the necessary SW changes to facilitate this and that. So, with a couple of interactive settings the behavior in that language could change (It would not be wise to implement such measures globally IMHO).
But for any of this to work:
- Languages should have sub-forums and put language specific info in there and communicate there. Alternative is to form it outsideā¦
- New & old users should be directed to that forum (if outside, this is not advisable)
- New users get an e-mail after registration to direct them to that sub-forum.
- People donating to a language should be accessible via secure e-communication.
- ā¦
Currently, for Turkish, we could do the first item. Others can be done in time, but the last one is not possible AFAIKā¦
Status
For me and Turkish, status is as follows:
- On March 2021, when I re-started to work on this topic, I did 800+ recordings but stopped for the sake of diversity. Some did not and tripled their recordings.
- I was away for 3 months, when I returned, about 6700 recordings (calculated from corpus data) were waiting to be controlled, I did more than 2k+ but what about the second check?
- 11k+ sentences are added (only) by me, one person added more but had to be deleted (bad source & malformed divisions). This is the toughest area and needs more people.
- Female voice percentage is %6, I hear only one now. All are young males. And there is an unknown 24%. With current privacy measures one can listen to them, or just reject them.
- From Corpus v6.1, I listened to many recordings where accept=2 & reject=1 and many of them should be rejected. People just ignore small mistakes - they donāt know!
- There was one person who recorded many clips (guess: 1000), with a heavy accent, good to have that, but not that much. But another one came in line, someone who shouts and/or changes his voice in many funny ways. I had to reject many of those recordings.
- ā¦
So, this is not a good, healthy datasetā¦
Possible measures
In a perfect world what measures would be needed (all with time scale and non-identifiable) for health analysis AND performance? Some could be:
- Real time, daily or weekly data on all measures (whichever possible from the current infrastructure - we can only have it from Corpus summary and/or data)
- Total/new subscribers for a language, their gender/age distributions
- Total/new recording / listening / sentence addition & verification numbers, their cross tables / graphics / frequency distributions / ratios to answer questions such as: āHow much listening do people with recording >100ā.
- Who are the people with most recordings rejected? How many total rejects?
- What are the number of active volunteers. How many did we loose / never visit? How much time do they spend?
- How many of them are also active in Discourse / UI translation etcā¦
- ā¦
There might be many more from volunteer management areaā¦
Edit (21.09.2021) - Additions to possible measures:
- Number of downloads from each Corpus version
- Distributions like recordings/visit, listening/visit, sentence addition/visit, sentence verification/visit and how much time they spendā¦