Difficulties on using sentences collection tools on importing big amount of sentences

irvin · March 9, 2019, 6:16am

My previous process of collecting sentences during sprint event,

Host sentences collection sprint, asking the participants to find / collect CC0 sentences materials into a pad / google doc / text editor, leave their docs URL at event co-note.
We check each sentences docs one by one altogether at the end of the event, discuss and give suggestions - is this CC0 compatible? Are the sentences overall too long / too short? Is there anything to fix, eg., contain numbers that need to convert into words…
When they feel their sentences are good, they send me the link
I go through all sentences one by one with a text editor, remove all non-appropriate sentences or modify them, eg., remove non-daily / too old-fashing dialogs, change sensitive personal data like names to fake names, break too long sentences into two or combine too short one. At this step, I will modify about 60-70% of them.
After all sentences been reviewed, I shuffle them to further de-identification and de-duplicate them.
Submit through a P/R (several days after the event).

The problem I faced with sentences collection tools,

If I ask people to submit sentences directly to tools, we are non-able to review and discuss over them together. They are hard to fix the sentences that have something to improve. They cannot withdraw, batch edit, and re-submit.
I can only click approve and reject. I cannot fix them. When I review the sentences with the text editor, when I saw anything bad, I can fix them directly. But it’s not able to do on collection tools. More than 50% of the draft sentences will drop into these cases in my previous experiences.
The review UI is far too slow (> 10x slower) compared to the text editor (just go down and delete bad lines) when we have a very big amount of sentences. (I currently have more than 5000 during our last sprint to review.)
If people submit the whole batch of non-good sentences (copyright materials, too old novels, non-appropriate materials), we will need to reject them with tens of thousands of clicks. We cannot directly jump to the good part.
If I submit the sentences after a pre-manual review, to make sure the sentences are good for the above problem, I will need to click a thousand times of “approve” for the sentences I had already checked. And after that, I need to ask other people to do the same thing in order for the sentences to be “approve” to get online. It’s triple efforts.

Biggest problem - Time cost

For the small community like us (there are dozen of people “watch” but mostly one or two people “work”), it’s a process with very big hassle. With the same amount of time spent on the approving process, I may able to review and collect triple more amounts of sentences.

Time is the most precious resource for us volunteers. And I believe the Common Voice communities are mostly small whether in any countries. If the tool cannot save time, its would be really bad to ask volunteer switch process.

If we force people to use the far more slow process, it’s far less possible that we reach our goal of the amount.

The scenario of the current tools

The current sentences collection tool is a more logical way for collecting random sentences from users, but not suitable to batch import known good sentences from known contributors. It’s really not fit.

nukeador · March 11, 2019, 1:04pm

Thanks for sharing, really informative.

As we discussed on telegram, we know the sentence collector is not perfect but it’s the best tool we have right now to be able to handle all communities with a high degree of confidence on the sentences submitted (the previous process resulted in a lot of time from staff to review PRs, uncertainty who to trust and a lot of wrong sentences being included).

Having said that, I acknowledge the current issues you describe in this post:

A high percentage of people submitting wrong sentences: Note that the current tool flags which wrong sentences are you submitting and the reason, allowing people to self-correct themselves can probably be faster than waiting for others to tell them or fix it.
People submitting copyrighted material: Probably better education before the event? It’s really difficult to control this, specially for random contributions, sometimes we will never know unless there is an API to check for copyrighted material.
Slow UI for review: We want to balance between speed and proper review, there are a few suggestions out there, we are open to incorporate some of them and the UX team is already thinking on that.

I honestly think that having one sole reviewer is not a good idea and can’t scale, it could work now for a few thousands but in general it will decrease quality (less eyes) and burn-out contributors. We will need at least 1,8M sentences unique sentences for 2000hrs and this can’t be done by one/two core reviewers.

/ccing @mkohler here also to have his thinking.

irvin · March 11, 2019, 1:23pm

But the reality is, most of Mozilla projects had just 20% of the contributor who dedicates 80% of the contribution, and in our local community size, I believe that’s one or two in most countries. We can do a survey to find out.

nukeador · March 11, 2019, 1:24pm

To be honest, there is no way Common Voice communities can work that way and we should enable systems to have a lot of people involved. Doing the math, no locale will be able to get to 2000hs if the community model is what we are used at mozilla communities.

irvin · March 11, 2019, 1:25pm

It would be good if we can reject all remand sentences for one submitter, that can solve both problems.

nukeador · March 11, 2019, 1:26pm

Yeah, we started talking about that here:

github.com/Common-Voice/sentence-collector

A way to avoid a lot of weird sentences when reviewing

opened 09:18PM - 30 Jan 19 UTC

closed 02:00PM - 16 Jul 20 UTC

nukeador

enhancement area: review

I've found thousands of sentences in Spanish from an old book, most of them have… super old and weird language, I would like to skip them all and remove them from my review list, since they are first in the queue and noise in my workflow. We should probably have a way to vote negative to all sentences from a given upload (or user, but I don't know if this user has uploaded other valid sets). Cheers.

irvin · March 11, 2019, 1:41pm

I don’t think the problem is on The System. The problem is on the contributor numbers bases.

Local open source community just have this amount of contributors, and natually just a fraction of them dedicated to Mozilla projects. So many years new people come and old people leave, we’re not even close to this amount during Firefox 3 release and Firefox OS era, which is our most rapid growth time. I have no idea how can we do it for Common Voice.

Maybe we need to think of the strategy to grows to thousands of contributors, before prepare the tools for that scale.

Also, I don’t believe every language need that size of data. It’s “better than none” for some languages, and we should also ease the effort for those contributors.

nukeador · March 11, 2019, 1:57pm

That’s currently what I’m working on. Specially because the way I see Common Voice communities is a combination of individual contributors, local communities, crowdsourcing, organizations and partners. It’s not what we are used at Mozilla for sure.

irvin · March 11, 2019, 2:09pm

I’m more worried about “burned out” of contributors because of the policies and goals, but not the work load.

As an example, contributors didn’t burn out not because of Firefox OS release schedule is so tight and the is so difficult, they burned out because the direction and because we made them hard by ceasing their effort at once.

By the time when we have 100x contributions, we can easily scaling the system with more core contributors (if in the same percentage we will also have 100x more core contributors) and be working with the better scale tools.

But for now we don’t have that numbers, and we shouldn’t make the current people harder, with a system designed for 100x more people.

Take now as an example, if any of us really need to click 3000 times of approval button in order to make it through, we will burn out right now and not wait til we grow to 100x size.

nukeador · March 11, 2019, 2:24pm

I have a strong opinion here, but I think we will never have x100 (or x1000) contributors if we use the same processes and tools we always use and that are not welcoming and enabling big crowds of contributors to get involved.

We probably disagree here, but my experience tells me the approach we are taking with Common Voice is the right one to get these numbers. The good thing is that we are experimenting and testing the approaches fast, and we can discard the ones that didn’t work, that’s why I think we should all be open to try new things and see what happens, that’s why I’m pushing you and other contributors to be more comfortable and flexible with our new approaches

Cheers.

irvin · March 11, 2019, 2:38pm

I do like crowd-sourcing and I’m much welcome for the new participants. That’s why we talk to so many different background people in different places, create the community channel that’s decentralize and free to join and discuss, and tried to onboard them to more contributing than just record and download, and that’s why I would like this to happen in order to better welcome more new volunteer,

To be honest, I always want to find some volunteer with a linguistic background to do the work, and I can pivot to another new domain (we have too many projects to care!)

But we are not there yet, and we still have “how to import a big amount of sentences” problem to be solved.

(I’m thinking to pause here and wait for other people’s comment.)

mkohler · March 11, 2019, 5:22pm

Editing sentences during review is in the pipeline, no idea when we will get to it though. There are also some edge cases we probably will need to discuss when the time comes. I wonder why 50% though, that seems waaay to high for me. Can you elaborate on what type of fixes needed these are?

That’s the whole point of the review process. We don’t want one single person to decide whether the sentences should be included or not. Sure, the UX around that could be improved, but the principle will stay the same.

irvin · April 2, 2019, 5:30am

We had like 3k sentences waiting for validate now…

Topic		Replies	Views
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	34	8961	December 17, 2018
Sentence collection tool development topic Common Voice sentence-collection , announcements	30	4104	January 26, 2019
I'm almost giving up on the project. Feedback from a big contributor (10000 sentences sent, 7000 listened) Common Voice	24	2312	March 15, 2023
Sentence Collector Open Discussions - Input needed Common Voice sentence-collection	17	3708	October 2, 2020
Sentence collection tool - February 2019 milestone discussion Common Voice sentence-collection	11	1850	January 23, 2019

Difficulties on using sentences collection tools on importing big amount of sentences

My previous process of collecting sentences during sprint event,

The problem I faced with sentences collection tools,

Biggest problem - Time cost

The scenario of the current tools

Related topics