Feedback on how we collect and validate sentences

When we reviewing the sentences, what’s our process to ensure the sentences are really CC0 with the current tool.

There is a column on the submit page, but how / who will check if they are the correct source before it goes live to cv site?

When I deal with sentences people donated before, the most important work is to rewrite the most sentences to make them “more neutral”. A purely accept/reject process, is lacking a “fixing the sentences” function that core contributor would need.

If the sentences won’t going directly online and will get a second manual review, then it won’t be problems for me.

An additional discussion on the auto period issue,

We don’t have the period on the current Chinese data sets (for both tw / hk / cn), and I don’t think it’s problems without them so far.

I’m really like to have the ability to skip the “auto period” feature on the per-locale basis.

(If the data are not going online to CV site directly, I can manually remove them.)

Currently, the biggest problem for me is how the process of sentences collection will change after the tool goes online.

Is this is a tool just to help us gathering sentences, and we can still manually check/modify the past-reviewed sentences (with Github PR or another way), or the past-reviewed sentences will go directly online?

If it’s the first one, then many problems can be solved later and it won’t be a blocker for me.

If it’s the second, then I’m pretty afraid of how to keep the data quality, or ensure they’re all real under CC0.


Take the sentences sprint we have last year for example, although we make the requirement of sentences very very clear in face-by-face, and all participants are current Mozilla contributors, all sentences we got from sprint event still require tweaking in some way before putting them online.

Take the current sentences PR on voice-web as the second example, although we make it very clear that sentences should be from CC0 source, many PR still submit sentences that are not legally compatible, and need to clearing once we ask again for the data source.

If any non-cc sentences go online, and be discovered later by some user of data, then it will be a nightmare that we will need to review all sentences one by one by googling them, and we will break our promise on data free of license.

I would like to ensure it won’t happen beforehand but not fix the issue after.

So IMHO voting is really not enough to make sure things is good to go online.


The ideal process for me is like our Firefox L10 process, the locale owner make the final decision, and tweak all translations before it goes into production product.

  1. The tool is there help to gather sentences from the general contributor.
  2. After sentences got enough vote, they go somewhere.
  3. Then core locale contributor can make the final decision of each is good/bad or good after tweak.

My imagination is, there will be many sentences submitted through the tool, but eventually, that one or two core contributors per locale will still donate most of the sentences, so we could make sure we’re easing their life, but not add tension to them.

Thanks for your feedback Irvin, greatly appreciated. I’ll answer two things and leave the rest for Nuke to answer.

A sentence fixing feature would be nice indeed, mind filing a bug in the github repo so we can track that for after the MVP?

I think this is just a display bug when reviewing sentences, not that they actually get saved with a period at the end. Will double check though.

Moving to a new topic since this has general feedback about process and not about the tool development itself.

We want to incorporate automation here to check with the existing algorithms that allow to check license on a text corpus. We are in a similar situation as wikipedia with their articles.

It won’t go directly (because we haven’t coded that yet) but that’s the idea, approved sentences (2 positive reviews) will be pushed into common voice repo.

From your message I see you identify two issues:

  • How to ensure public domain.
  • How to fix sentences.

For the first one, as I said, we want to incorporate automation when possible.

For the second one we want to allow in-site editing when reviewing in the future, but for the MVP the only thing you can do is to reject when reviewing.

About the approval process:

We want to follow the same crowdsourcing model we are using for voice collection and that has been working well. I understand your reservations for quality and that’s definitely something important.

We can adapt the process if we see it’s not working, but I’m confident we are implementing most of the automation in the tool to allow a big group of people to be involved with a huge amount of sentences and at the same time ensure quality.

Are there plans to run the existing sentences on the site through the collector for validation? There are a lot of mistakes, at least in English. Although this may not make sense until sentence editing is available.

When we release the dataset we want to ask for clean-up, but probably we can run some of the validation algorithms we have on the collector. @gregor what do you think?