Question about language ownership and data quality blame

laubonghaudoi · August 21, 2024, 6:38pm

I originally posted this question in the Matrix chat and I am posting here for more discussions: Does Mozzilla keep contact with the admins of each language ? Like, the person who originally proposed to add a language, and the one who are ultimately supposed to be responsible for the data quality assurance/website translations of that language?

I’m asking this question because, if we find a CV spilt of a language that is full of garbage, who should take the blame? Or there’s simply no ownership and we don’t care even if it’s full of garbage data?

Jessica replied in the Matrix chat that we will have some form of language community manager program to address data quality issues which I think is a good start. There are 129 languages in CV and it’s impossible for a single person or a single group to keep track of every language so there has to be some form of delegated ownership. What I have been envisioning is that we can have make the community manager program as a “language owner group”, where all data quality and language planning issues can be redirected to the owners of that language. These language owners have the power of approving translations of the website or cleaning up bad data.

This also brings up another question which is, when can we support bulk-removal of bad data, including text and recordings? By removal I am not saying deleting because I know the unique hash id stuff, but blacklisting sentences and making sure that they don’t get into the released set is definitely feasible. It has been a while since we first pointed out data quality issues in many languages. These quality issues keep accumulating and will eventually make the dataset unusable.

gina · August 26, 2024, 8:48am

Thank you for raising this issues and concerns. As @jesslynnrose mentioned in the Matrix channel, we’re working on reinstating the language reps program, where each language will have its own representative. I’ll also discuss the “language owner group” idea with our team, product manager, and director to get their feedback. Regarding the bulk removal of bad data, I’ll bring this up with the engineering team. Have you filed this as a feature request on GitHub?

laubonghaudoi · August 28, 2024, 8:30pm

Thanks Gina! No this is not a feature request, but more like a discussion. I want to hear what the Common Voice leadership think about this. Is it required to file an issue on GitHub?

bozden · August 31, 2024, 2:10pm

So, if this is seen as a discussion, let me phrase some ideas and points from our past experience:

I’m thrilled to hear that some sort of Language Representative program will be re-instated. But it should not be left as a name, but be executed. Which might include:
- Regular meetings (every 1-2 months across the year, probably 2 sessions to cover all timezones - more than 2 sessions left us with very few participants sometimes)
- In some of these meetings, also the larger team should join, e.g. to discuss problems and/or ideas in linguistics, webapp, datasets, etc. I think 2 per year will be fine, e.g. one to speak on the plans for the next year, the second for the alpha/beta feedback.
- Meetings should be recorded, viewable by other LRs who could not join. One also would need shared meeting minutes, itinerary, ideas shared and decisions etc… We’ve been doing these through shared Google Documents (sometimes used like a shared-white-board), later edited by the moderator - which worked fine.
CV is for dataset creation, mainly for ASR. Some lead contributors might be working in ASR area, some might be linguists, but especially for minority languages, people are here to re-vitalize their language, so their “knowledge” in these areas might be limited. They have a great strength thou: Being “the” person who does know that language natively. So, it might be good to invite as many people as possible (even push them) to be a LR would be a good idea, for all released and upcoming languages.
Form a matrix group for LRs for continued discussion. Meetings should be more to the topic (with pre-written agenda).
But LRs are not enough: Languages need a lead team (like a language board), they should perform campaigns to reach wider volunteers, create sub-projects for data quality, etc.
Among the LRs, sub-groups can be formed. E.g. volunteers to help newcomers in 1-to-1 sessions, presentations for case studies, Q&A sessions etc. The emphasis here is: Communities (and LRs) should be the locomotives for these kind of actions, where MCV will be the facilitator.
The main problem CV has: Dedication. There are many languages proposed, with zero progress. Also many people join here for their projects and move away after sometime. So, there should be replacement-LRs, better teams as proposed above.

Some views about the original post:

the one who are ultimately supposed to be responsible for the data quality assurance/website translations of that language?

We might not be able to find such people, and as stated above, they might disappear.

Pontoon is another environment. Except minority languages, many languages have volunteering professional translators dealing with multiple projects, main focus is on Firefox… But with the new rules, these people can easily translate the important parts and go forward of course.
Data quality assurance is tough - you just cannot assure. This would need a team effort if even possible, and current workflows do not help this.
- First of all how is quality defined? E.g. are single word sentences bad for all uses? Would one “owner” try to create a clean dataset for his/her purposes? As CV datasets are general purpose one cannot set hard rules for these, but just give some guidelines (actually it does give some of them).
- Data will be flowing, such a person/team should check every added sentence and recording, at least to give a vote. Even that person is an expert, or a LR, he/she has one vote. And sometimes those decisions itself are tough (e.g. if a sentence should be accepted, or would this pronunciation cause problems).
- One can see the whole after a release comes out, at a point where it is late. If anything is possible, that can be only done for the next release.
- Suppose a medium sized dataset has 100k recordings in validated. How would you “re-assure” the quality?
  - By listening to all of them ? A nearly impossible task except very small datasets. Single vote is enough? They are already voted, why would we need another vote? Will it be higher value vote or would we need another 2/3 votes?
  - By running an ASR on them and check the ones with high WER values? But this would need a model to be trained/fine-tuned - impossible for low-resourced languages. If this is possible, fine. But as stated above not all people can do this, thus not all LRs can assure anything.
  - A good example for such a process is the Artie Bias Corpus (paper). See section 3.2 for the process for such a small set and think of ~25M recordings on CV datasets.

who should take the blame?

Nobody should, as stated in points given above. Plus, this is a crowd-sourced, volunteer based project, where Mozilla Common Voice is the facilitator for our languages. Can we blame the greater language community (all volunteers) because they don’t take care of their language dataset?

“language owner group”

As I stated on Matrix: Ownership is a strong word, I’d use “caretaker” or “curator” (I’m also a museologist).

And in my opinion, the best way is already here, minus the caretakers and minus the speed of MCV: The caretakers should work with the team for major decisions (e.g. bulk deletions, merging of languages, etc). But both sides should be open for the changes and be ready to make compromises, perhaps another solution will be better.

when can we support bulk-removal of bad data
blacklisting sentences and making sure that they don’t get into the released set

I’m with you regarding this. Perhaps something like the bulk-sentence-addition can help.

Sentence_id - request_reason
path - request_reason
client_id - request_reason

My 2c…

Edit: Language, punctuation etc.

kathyreid · August 29, 2024, 1:50am

@bozden strong agree with all your points - well written