So, if this is seen as a discussion, let me phrase some ideas and points from our past experience:
- I’m thrilled to hear that some sort of Language Representative program will be re-instated. But it should not be left as a name, but be executed. Which might include:
- Regular meetings (every 1-2 months across the year, probably 2 sessions to cover all timezones - more than 2 sessions left us with very few participants sometimes)
- In some of these meetings, also the larger team should join, e.g. to discuss problems and/or ideas in linguistics, webapp, datasets, etc. I think 2 per year will be fine, e.g. one to speak on the plans for the next year, the second for the alpha/beta feedback.
- Meetings should be recorded, viewable by other LRs who could not join. One also would need shared meeting minutes, itinerary, ideas shared and decisions etc… We’ve been doing these through shared Google Documents (sometimes used like a shared-white-board), later edited by the moderator - which worked fine.
- CV is for dataset creation, mainly for ASR. Some lead contributors might be working in ASR area, some might be linguists, but especially for minority languages, people are here to re-vitalize their language, so their “knowledge” in these areas might be limited. They have a great strength thou: Being “the” person who does know that language natively. So, it might be good to invite as many people as possible (even push them) to be a LR would be a good idea, for all released and upcoming languages.
- Form a matrix group for LRs for continued discussion. Meetings should be more to the topic (with pre-written agenda).
- But LRs are not enough: Languages need a lead team (like a language board), they should perform campaigns to reach wider volunteers, create sub-projects for data quality, etc.
- Among the LRs, sub-groups can be formed. E.g. volunteers to help newcomers in 1-to-1 sessions, presentations for case studies, Q&A sessions etc. The emphasis here is: Communities (and LRs) should be the locomotives for these kind of actions, where MCV will be the facilitator.
- The main problem CV has: Dedication. There are many languages proposed, with zero progress. Also many people join here for their projects and move away after sometime. So, there should be replacement-LRs, better teams as proposed above.
Some views about the original post:
the one who are ultimately supposed to be responsible for the data quality assurance/website translations of that language?
We might not be able to find such people, and as stated above, they might disappear.
- Pontoon is another environment. Except minority languages, many languages have volunteering professional translators dealing with multiple projects, main focus is on Firefox… But with the new rules, these people can easily translate the important parts and go forward of course.
- Data quality assurance is tough - you just cannot assure. This would need a team effort if even possible, and current workflows do not help this.
- First of all how is quality defined? E.g. are single word sentences bad for all uses? Would one “owner” try to create a clean dataset for his/her purposes? As CV datasets are general purpose one cannot set hard rules for these, but just give some guidelines (actually it does give some of them).
- Data will be flowing, such a person/team should check every added sentence and recording, at least to give a vote. Even that person is an expert, or a LR, he/she has one vote. And sometimes those decisions itself are tough (e.g. if a sentence should be accepted, or would this pronunciation cause problems).
- One can see the whole after a release comes out, at a point where it is late. If anything is possible, that can be only done for the next release.
- Suppose a medium sized dataset has 100k recordings in validated. How would you “re-assure” the quality?
- By listening to all of them ? A nearly impossible task except very small datasets. Single vote is enough? They are already voted, why would we need another vote? Will it be higher value vote or would we need another 2/3 votes?
- By running an ASR on them and check the ones with high WER values? But this would need a model to be trained/fine-tuned - impossible for low-resourced languages. If this is possible, fine. But as stated above not all people can do this, thus not all LRs can assure anything.
- A good example for such a process is the Artie Bias Corpus (paper). See section 3.2 for the process for such a small set and think of ~25M recordings on CV datasets.
who should take the blame?
Nobody should, as stated in points given above. Plus, this is a crowd-sourced, volunteer based project, where Mozilla Common Voice is the facilitator for our languages. Can we blame the greater language community (all volunteers) because they don’t take care of their language dataset?
“language owner group”
As I stated on Matrix: Ownership is a strong word, I’d use “caretaker” or “curator” (I’m also a museologist).
And in my opinion, the best way is already here, minus the caretakers and minus the speed of MCV: The caretakers should work with the team for major decisions (e.g. bulk deletions, merging of languages, etc). But both sides should be open for the changes and be ready to make compromises, perhaps another solution will be better.
when can we support bulk-removal of bad data
blacklisting sentences and making sure that they don’t get into the released set
I’m with you regarding this. Perhaps something like the bulk-sentence-addition can help.
- Sentence_id - request_reason
- path - request_reason
- client_id - request_reason
My 2c…
Edit: Language, punctuation etc.