We want your feedback: Improving the sentence collection

Thanks everyone for your feedback, this is really helpful to define the requirements to improve Common Voice sentence collection, please keep it coming.

A quick note: I will be taking a few weeks away, so expect less iterations from staff until the end of August, but please, keep the feedback coming to this topic in the meantime, I will be checking it as soon as I’m back to inform a recommendation.


  • Finding CC0 stuff is hard, everyone ‘liberal’ use CC-BY.
  • Having a way to tag alternate sentences that mean the exact same thing.
    • In Norway we have so many ‘official’ ways to say things, and some people only use one of them, some use the other, some very very few people use both.
    • In English “To be” can be either “Å verta” or “Å bli”.
    • We have a ton of this.
  • Having a built-in system for translation of English sentences.
    • I’ve done this manually for now, it’s boring, but I’m also a bit concerned that some interesting data (the connection English sentence <=> Norwegian sentence, is getting lost).
    • It also needs to have several output sentences for one input sentence.
  • When taking in new sentences, check all new words so we can check if it’s in the correct grammar.
    • Sadly we even have “choose-your-own-adventure” grammar in Norwegian.
    • You have to be internally consistent, but you can choose to either write “to be” as “å vera” or “å vere”. Yes, in addition to “å bli”.
    • We would only want to have one of those forms in the corpus, so that the speech recognition only comes out in one consistent form.
    • That’s a hard problem, and I think Norwegian has it worse than most, but anyone would benefit from rules, stats and information on importing (or in review).
  • We could have a simple way for people to contribute their blogs as corpus.
    • That’s how I’ve gotten most of the sentences I’m preparing for Norwegian.
    • However it needs rather intensive proof reading.
    • Or even other places like Facebook / Twitter.
  • Not for sentence collection: but we also need to be able to say what dialect the person identify as speaking.
    • They sound extremely different, so a good Norwegian speech recognition will need to have a good distribution.
    • This is also my main interest in this project, as commercial speech recognition I’ve tried won’t understand you unless you change the way you speak.

I think it would be nice to have sentence templates with placeholders for things like cities, countries, female or male names and so on. The reason I like them, is that they could lead to less repetitions of the same word sequences over and over again. For this to happen, it has to be implemented right, though.

In German we currently have a lot of sentences of the form “$A is the capital of $B.” or “Can you walk from $A to $B?”. The main part of these sentences always stays the same, while the variables are substituted by geo-locations. This could lead to overfitting (and might be boring to read at some point).

If Common Voice supported real templates, it would be aware of the fact that there are multiple variations of the same sentence and such a template sentence would not show up more frequently than other sentences (or at least not much more).

Related to sentence collection is sentence correction. We need an interface for that, which takes the sentence’s bucket into account. Currently, if a sentence is in the “train” bucket and someone adds a missing comma via github pull request, the corrected sentence might land in the “test” bucket. That’s a problem.

When there is a public interface for contributing sentences, a set of detailed rules and guidelines would be helpful. The ones at https://voice-sprint.mozilla.community/contributing/ are a good start, but they leave a lot of questions open. Also, they do not contain any language specific hints. Two examples:

  • Are colloquial shortcuts or spellings allowed?
    en: “want to” -> “wanna”, “going to” -> “gonna”
    de: “heran” -> “'ran”, “nichts” -> “nix”, “deine Mutter” -> “deine Mudda” (yes, this is a common term and e.g. Google’s STT engine spells it like this)
    Keep in mind: If such alternative spellings are in the text corpus, all voice contributors have to differentiate between these, as well.
  • What about different spellings of names? Hanna vs. Hannah, Nils vs. Niels, Gustav vs. Gustaf, Jasmin vs. Yasmin and so on. Should only the most popular form be included (which is often hard to tell)?

As more people contribute to the corpus, more of these questions will arise. They have to be discussed in the community and the guidelines have to be regularly adapted.


News sites could be useful as well with constantly updating content.

Here are some punjabi news sites:

Sites like the BBC that have articles for many different languages could be a good source as well.

Additionally, while it may not be “everyday language”, all European Union documents need to be translated into many different languages by language professionals.

This could serve as a basis for scraping many different sentences: https://europa.eu/european-union/documents-publications/official-documents_en

1 Like

Hi again,

Thanks everyone for your comments. In the following weeks we will be scoping and defining a set of requirements and features we would like to see in a MVP of the tool and also clearly define the user journey based on all your feedback and existing list of requirements from other groups.

I’ll be sharing a draft as soon as it’s ready and reviewed.


Quick update: We have the MVP requirements draft ready and we will be reviewing it this week. Once we feel it’s ready, I’ll be sharing it here for feedback.



Hi again, sorry for the delay, we had a team off-site last week and I wasn’t able to share this with you.

After checking the feedback in this topic, together with other channels, we drafted a MVP (minimum viable product) that we want to share with you for feedback.


  • This MVP includes the things we considered more important for a first release.
  • We will be gathering feedback in this topic until September 23rd.
  • Based on feedback we will iterate the document and share with our User Experience experts for a final pass.
  • Any visuals here are just quick mockups subject to change, they should not represent the final visual direction (no UX expert involved in them)

Common Voice Sentence Collection MVP

Project needs

  1. An input of sentences (categorized in language and source)
  2. A set of validation algorithms (ensuring length, license)
  3. An input of reviewed sentences.
  4. A way to transfer reviewed sentenced to the final database.
  5. General metrics (number of sentences, validated, reviewed)

1. An input of sentences (categorized in language and source)

A web form input for text should be available, this form should:

  • Allow single and multiple sentences in a form.
  • Allow upload txt files with multiple sentences per line.
  • Ask for the source language (auto-detect browser language).
  • Ask for the source of the sentence (your own, url, other)

2. A set of validation algorithms (ensuring length, license,)

Once you submit the form, a backend will process all individual sentences and apply different validation algorithms:

  • Length: Sentences should be 14 words or less.
  • License: Sentence are not recorded as copyrighted material not under public domain.

If issues are presented, the result of this validation will be presented to user, who can edit problematic sentences or just submit the validated ones.

Once submitted, user will be asked to keep helping and presented with the review sentences.

3. An input of reviewed sentences.

User will be presented with sentences from other users in their language (auto-detect from browser) to validate. People should be able to:

  • Validate a sentence right away.
  • Reject a sentence right away.
  • Edit a sentence and submit it for validation

The way information is presented should be really similar to the review system for localization tools:

A way to submit more sentences should also be presented to the user in this screen.

Any user should be able to access the review screen anytime and select the language preference to review.

5. General metrics (number of sentences, validated, reviewed)

At any given moment, user should be able to see a quick reference of how many sentences were validated and reviewed for the current language.

Spanish: Validated (1300) Reviewed (567)

A page with all languages metrics should also exist.

User needs

  1. Guidance page: Where to find sentences? What is a good sentence?
  2. An input form to write or an upload mechanism (txt files)
  3. A way to see post-validation output.
  4. A system to review other people’s sentences.
  5. General metrics (number of sentences, validated, reviewed)

1. Guidance page: Where to find sentences? What is a good sentence?

A link to a page with documentation should be present in the tool at any given moment as well as very visible from the submission form.

This page should contain:

  • Description of the the 3 current good strategies to gather sentences
    • How do I get public license sentences from large sources? (examples)
    • How do I get linguistics involved in the project? (examples)
    • How do I submit original sentences myself?
  • Description of what constitutes a good sentences
    • Hard requirements: Lenght, license, grammar.
    • Nice to have: Names, cities, diverse sounds…

For 2, 3, 4, 5 see explanation in the previous section.


Question is, what will happen to our previous contributions that are awaiting validation? Will they be added to the system by you people or?

Also, I think we may want to reconsider the length of sentences since depending on the languages and the use of long words it may need to differ from the current count.

1 Like

Good point. What might be better is phoneme or syllable count. Unless maybe if the word count limit is the maximum limit of longest words there will be a chance of unintential “too long” sentences.

Depending on what you mean by “awaiting validation”, if they are already on our repos we can probably c&p into the tool for getting all of them into validation + community review.

The idea is that the tool will help us tackle the current backlog of sentences we have for many languages.

Quick update: We have started some work on the tool backend and we are checking the frontend workflow with our UX experts, we are moving a bit slower than we expected but we are making this a priority in the coming weeks.

October update

We are currently developing the sentence collection tool, and we expect to have a beta version to test by the end of the month :smiley:


November update

As you might have seen, we haven’t been able to finish the tool by last month. Due some changes in resourcing we are analyzing the best path for finishing it up.

We need to make a decision on if community can help with it (if you know react and kinto please send me a DM) or if we can engage an external vendor.

I’ll let you know as soon as we have all the info to decide.

Sorry for the delay :sweat:

For developers who would like to help with the sentence collector tool → Sentence collection tool development topic

5 posts were split to a new topic: Sentences that include groups of uppercase characters

2 posts were split to a new topic: Issue with sentences PR

December update

We hope to have a beta version of the sentence collection tool before the end of the year so we can test it and make sure it’s ready. Once ready we can use it to submit all backlogged sentences everyone has.

Since this topic has grown too long and there are other questions not related with the original ask, let’s organize any questions about the sentence collection tool here:

And please create new topics if there are questions about sentence validation.


A post was split to a new topic: Sentences from public-domain books