Sentence & Clip Reporting Strategy


We’ve had the sentence & clip reporting feature up for about a month now. So far the reports just go into our database, with no automatic action being taken. Attached to this post you’ll find all the reports we’ve received so far (Content Warning for the reported clips with the “offensive-speech” tag):

Some ideas that were already floated with regards to what we should do based on the reports:

  • Auto-downvote clips that people report (only for particular reasons given?)
  • Disable recorded sentences and mark their respective clips as invalid

There might be more or different things we can do here:

  • How do you think we should deal with sentences and clips with reports?
  • Should we deal differently for each category? How?

Thank you!

1 Like

I looked at a few of the clips and it seems like people are flagging them up for things like bad quality which you’re supposed to use the No button for. So clearly there is confusion over when to use No and when to use Report.

Another thing is that choosing the “Other” option seems to assign it to the clip when the complaint could be about the sentence. There’s no way to specify.

In terms of what actions should be taken:

  1. Flagging up a clip should count as a No vote for that user.

  2. We’ve had users record 50 or 100 clips in a row of racist abuse. There needs to be some way of getting those other clips out of rotation quickly once a certain number have been flagged. Maybe if 3 different users flag 3 different clips by that user in a certain space of time it puts that user’s remaining clips in quarantine? It could be limited to apply only to users with new accounts or with no approved recordings.

  3. Five votes for the same reason on a sentence removes it from rotation. Eventually there should be some way of viewing this in Sentence Collector and submitting a corrected one if necessary.

1 Like

I’d like suggest to add admin/admins to manage the feedback. Sometimes people record well misspelled sentences. So, we can correct wrong sentences rather that losing recordings.

1 Like

@gregor, the tsv file is it utf8 encoded?

that’s correcto! :slightly_smiling_face:

How would you scale your suggestion for languages where we can potentially have a lot of reports?

Is there a way to crowdsource what you are proposing so we don’t have to rely on just a few individuals (potential bottleneck)?

@nukeador ,
I would suggest an admin (admins) for that locales. That’s why I’m asking for. I don’t know how other locales are organised. For Kabyle, We have an important number of graduates from language departments that can help to correct. I’m planning a traning session for some (they are not techs) to help them with GitHub since there is no a suitable UI to correct.
NB: Some reported errors, actually aren’t.

@belkacem77 I think that’s a great Idea, I’m seeing some errors in the Portuguese corpus, I report the sentences, but nothing happens. If I had access to the repo, I could fix them and do a commit.

@Codigo_Logo_Programacao_e_Inteligencia_Artificial, You can correct them on Github, but it’s not a suitable UI for people who are confortbale with such tools.

@belkacem77 I’m glad to hear this, I’ll take some time this week to improve our repo, certainly will be a great step towards our goal.

My current thinking is how we can implement something we can scale and that can be managed by the community crowd.

The same way we are doing to validate clips, so we have more eyes and hands to help vs a few individuals.

1 Like

Has a policy for this been determined yet? Because now that a lot of the harder foreign words have been filtered out, I’m finding the most common reason I flag a sentence is because of poor grammar. Some examples:

In those days the councillors was called commissioners.

With these schools they creates the Easton Valley Community School District.

The storm caused severe flooding states such as New Jersey, New York and Pennsylvania.

The is the only line.

I can’t really think of a good way of automatically filtering these out, so I think the reporting function is really the only way. Of course, this requires someone to actually see it (which is a bad user experience) and flag it, by which time it’s probably already been recorded, but I really can’t think of a way to do it accurately without human review.

Thanks for pinging back on this.

Since we have been moving our main dev around we haven’t been able to properly prioritize this feedback.

@mbranson is this something we can make sure we have in the backlog so it is properly triaged when we have time?


Just realized I didn’t respond here, apologies! Yes, this is captured in our backlog and we’re working to prioritize this among various other needs / requests as part of 2020 planning. In the immediate future the Common Voice team focus is on infrastructure improvements, database optimization and releasing the latest dataset. In the meantime please keep providing feedback and proposals here, our goal is to incorporate these learnings into the reporting feature improvements.