Limit recording and validating per person (225 record, 450 validate).
I don’t see the point of putting limits on validation. Especially since this is clearly an area that will always lack contributors.
Limit recording and validating per person (225 record, 450 validate).
I don’t see the point of putting limits on validation. Especially since this is clearly an area that will always lack contributors.
I think it’s important, in order to eliminate/minimize bias.
I have been setting with people in record and validation sessions.
I have noticed that in some instances there are disagreements between them on what is valid and what is not.
People hear differently.
Sorry for writing late due to a long vacation, I’m merely trying to catch up. Anyway, I’d like to share what is in my mind. This also includes some UX suggestions.
I must say that I agree with @ftyers, @daniel.abzakh and the matrix post, on all points, perhaps except one: Limiting recording (if it means temporarily disabling). I said otherwise on the other thread but after some thought I changed my mind, at the end quality recordings are what counts. You may throttle new recordings of one person if he/she is recording too much (e.g. 100 sentences/day) and promote others like I try to explain below.
If we think this together with Recognition, Rewards and Contribution Pathways post, I would say “promote what is missing” on the dashboard, such as:
A casual user only gets to the dashboard. No visible pointers to (sub-)Discourse, even the sentence collector is somewhere deep (and not multilingual). I only could found it after I read some Discourse posts (at the very beginning of my journey). When you click the Contribute link above in the dashboard it shows you the “speak” tab first, why not switch it to listen?
As mentioned in the above Recognition post, the parameters in recognition/reward calculations can be adaptive to what is needed. If each action is 1 point at the start, you may drop recording points to 0.5 and increase the listening points to 1.5 ( can be calculated from queue length ) if the listening queue is increasing. You may even give extra/promotion points (a multiplier perhaps) if a language reaches an increase of xx% between Corpus versions. Communities might decide to make a limited time campaign and give extra points to the missing part. Even, you can promote/demote people’s recording quality, every bad (rejected) recording drops some points, continuous good recordings adds more…
Very broad range of opportunities here. Marketing people would suggest many more…
If you give coins/points etc., you should probably remove the current leadership boards and give that information in their profiles. Perhaps giving “x recordings rejected” type info will mean a lot here. You might like to put the “contributors with top points” list back instead… People have tendency to become top of the best, perhaps such a calculation would mean more for the overall quality.
But please do not remove points gained from the four main tasks. Not everyone can can open a stand in a fair or organize a campaign with printed materials, or have good communication and/or organizational skills. The project needs everyone, and time is the most precious think one has and they are giving it. You may lose some. The emphasis should be on creating a good thing and doing it together.
In addition: Common Voice should provide some e-tools (like mass-mailing to language x contributors) for promoting languages/communities, cold-starting and expanding a community is tough. Adding the following to the dashboard should be easy enough:
Designs of promotional material are a good start thou, thank you for those.
@daniel.abzakh, I’m no expert in ML, I only know a bit from the courses I took. But using some common sense and expertise from my long life in civil society, I can say the following, please correct if I’m wrong, probably most of them are points you already are aware off, sorry for that…
First of all please understand that I’m with you on limiting if absolutely necessary (as in “throttling”, but not as in “disabling”) the recordings of those users, if only a couple out of thousand volunteers are doing most of the recordings.
Their voice will be heard indefinitely, because this is a time capsule. Some may want their voices to be heard, some may want their dialect/tongue to live. Probably this is why they are so persistent… Perhaps CV can prepare a questionnaire and ask everybody the following question: “Why are you volunteering?”…
BTW, on another thread I suggested to throttle people if their recordings reach X times the average of that language… A more perfect area would be to throttle/disable people with rejected recording ratio exceed a threshold, to prevent spammers…
Edit: Typos and leftovers.
If the science says you should do it this way, then probably you should.
This is an indication of a problem, find out why and how to fix it.
There are many open source projects that need volunteers and help, we could steer their attention to those and utilize their energy more effectively where it’s really needed.
This is not enough excuse to allow mass recordings from one person.
Polluting the dataset adds additional overhead for researchers.
You are suggesting their work was a waste of time in the first place.
You could record e-books and open source them, this project has a different purpose.
In addition, negative impacts of allowing unlimited recordings per person:
Thank you for your comments ! I’ll only answer one of them as it seems to be misunderstood.
I’m saying that any negative effects they are causing will be diminished.
While I’m here, let me add some more thoughts…
First of all Common Voice has come a long way from 2018 when I first registered. The current roadmap and weekly posts are showing a bright future. But there is much work to do.
Here, the word “community” is understood in its broader sense - I think (all people donating to language)… From my perspective (civil society terminology) a community is a group of people interacting with each other / working together by being in a place. There are 900+ different voices and 30h validated recordings in Turkish, but CV-Turkish community size is now 1.
People come…
They come in and record, some also listen and they exit. In their free time…
They have no guidance. They wouldn’t know ML requirements or how to solve language specific cases/problems. There is no specification on how much they should record and when to stop anywhere. They have only very basic guidance and simple examples - if adapted, otherwise they are stuck with the dinosaurs. If they reach Discourse it is in English, even examples and the excellent Playbook (well, translation is in my to-do list). If there is a sub-Discourse for their language it is not easy to find either. Google and search works but they are not visible inside Discourse.
A language community -from my perspective- is an e-NGO, or rather a regional office of an international one, and should act as one. In that NGO there must be designers, marketing specialists, data analysts, ML experts etc. Dataset problems and/or expansion of it are language specific and should be monitored by a group of more knowledgeable people. So, like @ftyers suggested in the first reply, a “language board” might suggest a specific direction of the language, such as disabling recording for some people in that language - “CV board” must be in the loop of course. CV could provide the necessary SW changes to facilitate this and that. So, with a couple of interactive settings the behavior in that language could change (It would not be wise to implement such measures globally IMHO).
But for any of this to work:
Currently, for Turkish, we could do the first item. Others can be done in time, but the last one is not possible AFAIK…
For me and Turkish, status is as follows:
So, this is not a good, healthy dataset…
In a perfect world what measures would be needed (all with time scale and non-identifiable) for health analysis AND performance? Some could be:
There might be many more from volunteer management area…
Edit (21.09.2021) - Additions to possible measures:
Agree on this, if people cannot met and learn with other contributors (online / offline) within the current site structure and formed the community, “community health” is barely important. How can we improve overall community health if we don’t had ways to connect?
Until now many local Common Voice communities formed around some specific core contributors. It’s hard for us to meet and greet new participants, provide helps, answer questions. I only know contributors around me and is unfamiliar to like 99% of participants on our locale, some of them may had questions or like to participant more. If there is way for us to connect to more participants, we can have chance to onboarding them from leisure to core.
I always hope that we can have something similar to wikipedia’s links to local community channel (on top of every wikipedia articles!) to help bringing participants more close.
Thats the reason there are up to three persons in the validation process involved.
@daniel.abzakh @heyhillary
@bozden
@ftyers
Imho:
But here are some proposals to discuss:
PS: Limitation of the core function of CV is a little bit sub-optimal ?
As a developer working on a low resource language, getting enough sentences that are checked, cleaned up, not to mention contacting authors to release their material in public domain is a big task, around 1.8 million sentences are needed for 2000 hours, if the dataset is diverse enough, that could do it.
You can check out our teams work in these links 1,2.
Allowing contributors to record thousands of sentences is counterproductive. It is more productive to guide them to invite others to contribute.
I’m not sure what you mean by that but you can submit large batches of sentences via pull requests, here is an example.
When someone starts a new language section, the limit would be the 5000 sentences to open this section for the public. After recording these sentences (and no new ones added) the contributor gets a message, no further recording possible. So that would be the actual limit in the moment.
If the error quote is higher than 5% (randomly checked) the batch is declined. @mkohler knows this process better than me.
So to speak, if you want to keep the train running, a constant flow of new sentences to record is needed. Especially when media/social media campains are upcoming.
It is always good to get new contributors for CV
I just wondered why someone wants to limit the highest resource (recording clips of contributors) that CV Project has, because of diversity, slowing down the main goal of getting 10000+h of validatet clips.
Surely more data is better. If we had 10,000 hours of data from a single white upper middle class male, and that data can be used to train a model that sort of works for a privacy-focused TTS if you speak deep and sound posh, then that’s better than not having private TTS at all.
Oh come on, no one posted anything about one person recording 10.000h. That was clearly not the message before.
Yeah I was taking an extreme; even if it were that extreme it’d still be preferable to not enough data. If argumentum ad absurdum can’t even make a case for diversity over data then it’s folly at best, sabotage at worst.
IMHO the second important aspect of Common Voice is recording dying languages. There are about 7000 languages and about 2500 are in danger. And with the digital era, only %5 is presented on the WWW.
Most of these (very local, some tribal) languages are only spoken only by a few elderly people and they themselves are nearing their EOL.
I’d say -like we are doing in Oral History interviews for museums and language research-, let’s forget about diversity and ML in these cases and record them…
Thank you for reminding me…
I do not think that this is the main goal and thinking in such a way is very reductionist and problematic. We can think of the main goal (imo) as getting enough and appropriate data for a robust speech recognition system. This cannot be defined only in number of hours or number of clips.