Community Strategy: What does Community Health mean to you?

Okki · August 20, 2021, 8:17am

Limit recording and validating per person (225 record, 450 validate).

I don’t see the point of putting limits on validation. Especially since this is clearly an area that will always lack contributors.

daniel.abzakh · August 20, 2021, 8:31am

I think it’s important, in order to eliminate/minimize bias.

I have been setting with people in record and validation sessions.
I have noticed that in some instances there are disagreements between them on what is valid and what is not.
People hear differently.

bozden · September 16, 2021, 10:35pm

Sorry for writing late due to a long vacation, I’m merely trying to catch up. Anyway, I’d like to share what is in my mind. This also includes some UX suggestions.

I must say that I agree with @ftyers, @daniel.abzakh and the matrix post, on all points, perhaps except one: Limiting recording (if it means temporarily disabling). I said otherwise on the other thread but after some thought I changed my mind, at the end quality recordings are what counts. You may throttle new recordings of one person if he/she is recording too much (e.g. 100 sentences/day) and promote others like I try to explain below.

If we think this together with Recognition, Rewards and Contribution Pathways post, I would say “promote what is missing” on the dashboard, such as:

Your language has xxx sentences waiting to be verified.
Your language is out of new sentences, go find new ones, [link]here is how[/link].
You are getting low on [or out of] recordings, go record some.
There are xxx recordings waiting to be verified, go listen to them.
There are very few people recording to much, please consider inviting others.
There are very few female contributors recording, can you invite some?

A casual user only gets to the dashboard. No visible pointers to (sub-)Discourse, even the sentence collector is somewhere deep (and not multilingual). I only could found it after I read some Discourse posts (at the very beginning of my journey). When you click the Contribute link above in the dashboard it shows you the “speak” tab first, why not switch it to listen?

As mentioned in the above Recognition post, the parameters in recognition/reward calculations can be adaptive to what is needed. If each action is 1 point at the start, you may drop recording points to 0.5 and increase the listening points to 1.5 ( can be calculated from queue length ) if the listening queue is increasing. You may even give extra/promotion points (a multiplier perhaps) if a language reaches an increase of xx% between Corpus versions. Communities might decide to make a limited time campaign and give extra points to the missing part. Even, you can promote/demote people’s recording quality, every bad (rejected) recording drops some points, continuous good recordings adds more…

Very broad range of opportunities here. Marketing people would suggest many more…

If you give coins/points etc., you should probably remove the current leadership boards and give that information in their profiles. Perhaps giving “x recordings rejected” type info will mean a lot here. You might like to put the “contributors with top points” list back instead… People have tendency to become top of the best, perhaps such a calculation would mean more for the overall quality.

But please do not remove points gained from the four main tasks. Not everyone can can open a stand in a fair or organize a campaign with printed materials, or have good communication and/or organizational skills. The project needs everyone, and time is the most precious think one has and they are giving it. You may lose some. The emphasis should be on creating a good thing and doing it together.

In addition: Common Voice should provide some e-tools (like mass-mailing to language x contributors) for promoting languages/communities, cold-starting and expanding a community is tough. Adding the following to the dashboard should be easy enough:

Your language has a sub-Discourt, you have xxx unread posts waiting for you.

Designs of promotional material are a good start thou, thank you for those.

daniel.abzakh · September 18, 2021, 7:37pm

@bozden Could you explain your logic on why you disagree on limiting recordings?

bozden · September 18, 2021, 8:56pm

@daniel.abzakh, I’m no expert in ML, I only know a bit from the courses I took. But using some common sense and expertise from my long life in civil society, I can say the following, please correct if I’m wrong, probably most of them are points you already are aware off, sorry for that…

First of all please understand that I’m with you on limiting if absolutely necessary (as in “throttling”, but not as in “disabling”) the recordings of those users, if only a couple out of thousand volunteers are doing most of the recordings.

When you prevent a volunteer from their choice of voluntary action, you may lose him/her. This is human nature.
It would be more preferable to (positively) guide them to what is needed, as I explained above. Only after that, if their action disrupts/endangers the main task, first warn, then throttle them.
As specified everywhere, this is the only large scale data with CC-0, and it can be used for a variety of purposes. This is not limited to currently available methods, but we should always think of methods which will be “invented” in the future. For example, one can start to implement a system which ONLY responds to its "master"s commands, such as security systems which must recognize the master. This example would need as many recordings from the same person and this data can be used to test it.
The dataset is raw, researchers can select whatever their application needs.
This is a long term project. Each language starts slow and gradually speeds up. After a while your enthusiastic recorder’s voice will probably be a small portion of a hopefully very large data set. Think of a start-up language: One person takes responsibility, enters 5000 sentences and records them solely to enter into the list of supported languages. After years, that language will have 1.8 M sentences & 2000 hours.
In my opinion, no one would be [nerd|free|energetic|etc] enough to do it at the same speed for years. People will come and go and/or lose interest to the project.

Their voice will be heard indefinitely, because this is a time capsule. Some may want their voices to be heard, some may want their dialect/tongue to live. Probably this is why they are so persistent… Perhaps CV can prepare a questionnaire and ask everybody the following question: “Why are you volunteering?”…

BTW, on another thread I suggested to throttle people if their recordings reach X times the average of that language… A more perfect area would be to throttle/disable people with rejected recording ratio exceed a threshold, to prevent spammers…

Edit: Typos and leftovers.

daniel.abzakh · September 18, 2021, 9:28pm

If the science says you should do it this way, then probably you should.

This is an indication of a problem, find out why and how to fix it.

There are many open source projects that need volunteers and help, we could steer their attention to those and utilize their energy more effectively where it’s really needed.

This is not enough excuse to allow mass recordings from one person.

Polluting the dataset adds additional overhead for researchers.

You are suggesting their work was a waste of time in the first place.

You could record e-books and open source them, this project has a different purpose.

In addition, negative impacts of allowing unlimited recordings per person:

It is an addictive behavior that shouldn’t be encouraged.
It delivers low quality recordings.
For low resource languages, everything counts, and should be optimized, including how many sentences you can record.

bozden · September 18, 2021, 9:33pm

Thank you for your comments ! I’ll only answer one of them as it seems to be misunderstood.

I’m saying that any negative effects they are causing will be diminished.

bozden · September 21, 2021, 6:12pm

While I’m here, let me add some more thoughts…

First of all Common Voice has come a long way from 2018 when I first registered. The current roadmap and weekly posts are showing a bright future. But there is much work to do.

Community

Here, the word “community” is understood in its broader sense - I think (all people donating to language)… From my perspective (civil society terminology) a community is a group of people interacting with each other / working together by being in a place. There are 900+ different voices and 30h validated recordings in Turkish, but CV-Turkish community size is now 1.

People come…

They might or might not read the webpages (FAQ, how to pages, PlayBook)
They might or not register
They most probably do not enter Discourse, never to Matrix etc…
A very large portion of them would never know about machine learning, it is not expected.
Some them even do not excel in their mother tongue either.

They come in and record, some also listen and they exit. In their free time…

Community Needs

They have no guidance. They wouldn’t know ML requirements or how to solve language specific cases/problems. There is no specification on how much they should record and when to stop anywhere. They have only very basic guidance and simple examples - if adapted, otherwise they are stuck with the dinosaurs. If they reach Discourse it is in English, even examples and the excellent Playbook (well, translation is in my to-do list). If there is a sub-Discourse for their language it is not easy to find either. Google and search works but they are not visible inside Discourse.

A language community -from my perspective- is an e-NGO, or rather a regional office of an international one, and should act as one. In that NGO there must be designers, marketing specialists, data analysts, ML experts etc. Dataset problems and/or expansion of it are language specific and should be monitored by a group of more knowledgeable people. So, like @ftyers suggested in the first reply, a “language board” might suggest a specific direction of the language, such as disabling recording for some people in that language - “CV board” must be in the loop of course. CV could provide the necessary SW changes to facilitate this and that. So, with a couple of interactive settings the behavior in that language could change (It would not be wise to implement such measures globally IMHO).

But for any of this to work:

Languages should have sub-forums and put language specific info in there and communicate there. Alternative is to form it outside…
New & old users should be directed to that forum (if outside, this is not advisable)
New users get an e-mail after registration to direct them to that sub-forum.
People donating to a language should be accessible via secure e-communication.
…

Currently, for Turkish, we could do the first item. Others can be done in time, but the last one is not possible AFAIK…

Status

For me and Turkish, status is as follows:

On March 2021, when I re-started to work on this topic, I did 800+ recordings but stopped for the sake of diversity. Some did not and tripled their recordings.
I was away for 3 months, when I returned, about 6700 recordings (calculated from corpus data) were waiting to be controlled, I did more than 2k+ but what about the second check?
11k+ sentences are added (only) by me, one person added more but had to be deleted (bad source & malformed divisions). This is the toughest area and needs more people.
Female voice percentage is %6, I hear only one now. All are young males. And there is an unknown 24%. With current privacy measures one can listen to them, or just reject them.
From Corpus v6.1, I listened to many recordings where accept=2 & reject=1 and many of them should be rejected. People just ignore small mistakes - they don’t know!
There was one person who recorded many clips (guess: 1000), with a heavy accent, good to have that, but not that much. But another one came in line, someone who shouts and/or changes his voice in many funny ways. I had to reject many of those recordings.
…

So, this is not a good, healthy dataset…

Possible measures

In a perfect world what measures would be needed (all with time scale and non-identifiable) for health analysis AND performance? Some could be:

Real time, daily or weekly data on all measures (whichever possible from the current infrastructure - we can only have it from Corpus summary and/or data)
Total/new subscribers for a language, their gender/age distributions
Total/new recording / listening / sentence addition & verification numbers, their cross tables / graphics / frequency distributions / ratios to answer questions such as: “How much listening do people with recording >100”.
Who are the people with most recordings rejected? How many total rejects?
What are the number of active volunteers. How many did we loose / never visit? How much time do they spend?
How many of them are also active in Discourse / UI translation etc…
…

There might be many more from volunteer management area…

Edit (21.09.2021) - Additions to possible measures:

Number of downloads from each Corpus version
Distributions like recordings/visit, listening/visit, sentence addition/visit, sentence verification/visit and how much time they spend…

irvin · September 22, 2021, 2:30pm

Agree on this, if people cannot met and learn with other contributors (online / offline) within the current site structure and formed the community, “community health” is barely important. How can we improve overall community health if we don’t had ways to connect?

Until now many local Common Voice communities formed around some specific core contributors. It’s hard for us to meet and greet new participants, provide helps, answer questions. I only know contributors around me and is unfamiliar to like 99% of participants on our locale, some of them may had questions or like to participant more. If there is way for us to connect to more participants, we can have chance to onboarding them from leisure to core.

I always hope that we can have something similar to wikipedia’s links to local community channel (on top of every wikipedia articles!) to help bringing participants more close.

robovoice · October 21, 2021, 10:29am

Thats the reason there are up to three persons in the validation process involved.

robovoice · October 21, 2021, 4:55pm

@daniel.abzakh @heyhillary
@bozden
@ftyers

Imho:

The contributer of recorded clips is not responsible for the diversity of clips in the corpus.
“The” main function of CV is to get as many positive validated clips as possible to reach 10000h.
(Everything stays and falls with new sentences via sentence collector)
And now “limits” are about to be applied/discussed?
How long are these limits applied? A day, week, month, per account?
If one person is affecting the whole speach model then these limits must be applied to the corpus (sorting out double, triple or even more positive validated sentences, removing these and removing the too many hours of clips for this person and speach model, which can later be re-included)
Also, not the task of the contributer, but the dev and researcher who gets the impression: I need more diversity in this data, is in charge to take action by applying his rules to his needs to the corpus.

But here are some proposals to discuss:

Removing the “leaderboard for every user”
(What was the intention to implement this?)
and just showing the stats for every language with all accents and dialects.
A form of quality managment for the existing and future versions of the corpus. Sorting out the unneeded doubled sentences and setting hard limits of used hours/clips of top contributors in comparison to other contributors.
Re-checking validated sentences on latest corpus.
Ideally fully automated and before the corpus is released in the future.

PS: Limitation of the core function of CV is a little bit sub-optimal ?

daniel.abzakh · October 21, 2021, 7:11pm

As a developer working on a low resource language, getting enough sentences that are checked, cleaned up, not to mention contacting authors to release their material in public domain is a big task, around 1.8 million sentences are needed for 2000 hours, if the dataset is diverse enough, that could do it.
You can check out our teams work in these links 1,2.
Allowing contributors to record thousands of sentences is counterproductive. It is more productive to guide them to invite others to contribute.

I’m not sure what you mean by that but you can submit large batches of sentences via pull requests, here is an example.

robovoice · October 21, 2021, 8:03pm

When someone starts a new language section, the limit would be the 5000 sentences to open this section for the public. After recording these sentences (and no new ones added) the contributor gets a message, no further recording possible. So that would be the actual limit in the moment.
If the error quote is higher than 5% (randomly checked) the batch is declined. @mkohler knows this process better than me.

So to speak, if you want to keep the train running, a constant flow of new sentences to record is needed. Especially when media/social media campains are upcoming.

It is always good to get new contributors for CV

I just wondered why someone wants to limit the highest resource (recording clips of contributors) that CV Project has, because of diversity, slowing down the main goal of getting 10000+h of validatet clips.

david-song · October 26, 2021, 9:27pm

Surely more data is better. If we had 10,000 hours of data from a single white upper middle class male, and that data can be used to train a model that sort of works for a privacy-focused TTS if you speak deep and sound posh, then that’s better than not having private TTS at all.

robovoice · October 27, 2021, 5:20pm

Oh come on, no one posted anything about one person recording 10.000h. That was clearly not the message before.

david-song · October 27, 2021, 10:08pm

Yeah I was taking an extreme; even if it were that extreme it’d still be preferable to not enough data. If argumentum ad absurdum can’t even make a case for diversity over data then it’s folly at best, sabotage at worst.

bozden · October 28, 2021, 1:53pm

IMHO the second important aspect of Common Voice is recording dying languages. There are about 7000 languages and about 2500 are in danger. And with the digital era, only %5 is presented on the WWW.

Most of these (very local, some tribal) languages are only spoken only by a few elderly people and they themselves are nearing their EOL.

I’d say -like we are doing in Oral History interviews for museums and language research-, let’s forget about diversity and ML in these cases and record them…

robovoice · October 29, 2021, 12:22pm

@bozden

bozden · October 29, 2021, 12:27pm

Thank you for reminding me…

ftyers · November 5, 2021, 11:04pm

I do not think that this is the main goal and thinking in such a way is very reductionist and problematic. We can think of the main goal (imo) as getting enough and appropriate data for a robust speech recognition system. This cannot be defined only in number of hours or number of clips.