đź“– Readme: How to see my language on Common Voice

:triangular_flag_on_post: This information is also now available on the About Pages on Common Voice Website. Please help us to localise this by joining Pontoon


:open_book: Mozilla Voice Community Playbook: The source of truth for setting up and maintain self-sustainable communities.


Hello everyone,

I would like to open this topic to summarize some of the most asked question we are getting: How do I get my language in Common Voice.

There are three steps to have your language ready:

:globe_with_meridians: Have the website localized over pontoon

If your language is not there yet, please make a new topic with the request on this category indicating the language and the script.

:hammer: Skills needed: English knowledge, strong knowledge of your language.

Reference: Common Voice languages and accent strategy v5

:open_book: Gather a lot of sentences under public domain (CC-0)

:hammer: Skills needed: Command line usage and git, familiar with regular expressions.

:white_check_mark: Submit and review more sentences from other sources (not wikipedia)

To be incorporated into the database using the Sentence Collector tool.

:hammer: Skills needed: Strong grammar knowledge of the target language you are contributing to.

If you have found an existing public domain corpus bigger than 100K sentences, we have an independent process to handle it, since we understand that manual validation using the sentence collector is not ideal.

Please create a new topic here so we can evaluate if your corpus fits the license and size requirements to run this process.

:hammer: Skills needed: Expertise processing and cleaning up text, linguistics/language expertise to check the quality of the resulting sentences.

:next_track_button: Next step

Once you have enough validated and reviewed sentences (usually over 5000), we can enable a language to accept voice recording on the site and you might wonder My language is now collecting voice, what do I need to know?

:warning: Please note you will have to keep adding sentences to be able to allocate more recordings without repetitions.

Feel free to add any questions to this topic and we will be happy to support you :slight_smile:

14 Likes
How can I send sentences to contribute?
Common voice persian support
Where should I go to contribute new sentences?
Ultimate contribution tutorial for newcomers
Croatian language
🤖 About Common Voice: Readme first
Help preserving dialects from vanishing by allowing to add a dialect flag to spoken language
Dataset And New Language Request
Spoken language vs written language in Tamil
Hungarian language
Building Urdu Common Voice Dataset
Spanish dataset
Sentence collection for Belarusian
Enable Sinhala on contributing to collect and review dataset for Mozilla Common Voice
Please, add sanskrit language in pontoon
Single Sentence Record Limit feature release
Requesting Shan Language
Signing up for "In Progress" languages doesn't do anything
About the Volunteer role in the Mozilla workflow decision chain
Add in dataset Sakha language
Building an Arabic dataset for common voice
Portuguese dataset
Exporting reviewed sentences
Polish language ready to recording and reviewing recordings
Text Corpus Link Collection
Requesting the Cantonese language (yue)
Pashto Language
Please add isiXhosa in Common Voice
Low German
Moroccan Arabic Localization Request
Please add Bosnian as a language
Kannada and Sanskrit
I need to request to add my language Burushaski
New Language Support in Common Voice
Add different Arabic Varieties (dialects)
Request Uzbek Language
Add 'Hindi' language
Participation in the project
I want to bring in a new language Sanskrit for voice recogntion
Add Sinhala Language
Adding the Occitan language: ideas and strategies
Romansh has 5 different varieties (idioms): Sursilvan, Vallader, Surmiran, Puter, Sutsilvan

I wonder if you could add Basque to the in progress group. We are translating the website (78% done until now) and locating the sources we can use to prepare the whole data source: https://librezale.eus/wiki/EdukiakJabetzaPublikoan

We don’t have to wait until we have collected 5.000 sentences, isn’t it? How we must send them to you, linking chunks in Discourse, making pull requests on GitHub?

Thank you!

Please, keep translating the website and collecting public domains sentences. As soon as we have the new tool ready you will be able to submit the sentences for peer-review.

If you are asking about the number of sentences to enable the voice collection on the site, yes, usually we need at least 5000 reviewed sentences.

Hello!

We are working on adding Bulgarian to Common Voice and are almost finished with translating the web interface.

About collecting the sentences can you give some more details for the format and requirements that will make things easiest afterwards?

Is a simple text file with one sentence per line OK? I had seen somewhere there are limits for minimum/maximum number of words in a sentence. It would be good to state these officially here.

Also when we are ready with the web translation do we need to notify someone for it to appear on the website or it will be activated automatically?

Thanks and best regards,
Radoslav

Hi Radoslav,

thanks for reaching out, looking forward to having Bulgarian on the site (my sister’s husband is Bulgarian, so I have some skin in the game :grin: ).

Here’s the english sentence set as an example: https://github.com/mozilla/voice-web/tree/master/server/data/en
Indeed those are all text files, with one sentence per line.

Once the translation is done, you’d need to ping me and then we’ll activate it.

Best,
Gregor

Hello Radoslav,

About the number of words per sentence, around 15 words max should be great.
The idea is that a sentence should take between 3 and 5 seconds to read.

Best regards,
Luc.

Since the Bulgarian website is online already and we’ll start work on the sentence collection, one more question: What licenses are allowed? I’ve read on some other places about CC0 & Public domain? I guess that includes sentences authored by a submitter willing to license them using one of the two options above. Is attribution/author or source information required?

Yes, license needs to be Public Domain (CC-0) and we ask about the source so we can check it.

Thanks!

As other suggestions because I worked on adding the Italian language with Mozilla Italia community:

  • At least 7000 sentences
  • No badwords because the website is open to everyone and there is no regulations
  • Check if the sentences can fit 20 seconds that is the maximum of the recording
  • Validate the sentences based on your language and grammar rules
  • Take them from different context: books, school stuff (people donate their thesis as example), religion, IT, novels, interviews, etc.
  • Review and make a workflow where there are no blocking people

PS: no wikipedia to avoid issues with licensing and be free to manipulate based on your needs.

Here’s a blog post on how we added a new language to Common Voice, from start to finish.

http://jrmeyer.github.io/misc/2019/05/29/mozilla-kyrgyz-common-voice.html

3 Likes

Hi,

Is the step 3 still blocked? I have plenty of Spanish sentences and I would like to upload them.

The source are books of public domain Spanish language literature, selecting specially dialog phrases.

On the other hand, are there other requirements about the length of the sentences, or grouping them by some categories, or type of words in the phrases or the named entities or the numbers or the dates, or the use of slang or anything else?.

Best regards,
Mar

Ideally we would have a beta version of the sentence collection tool to test before the end of the year, if the QA of that beta is satisfactory we can start using it to collect and review sentences.

And I have to say I understand it’s frustrating but please, keep collecting sentences so we can submit them through the tool as soon as it’s ready.

Yes, we are working with the Deep Speech team to have a document with all the requirements, and we want the sentence collection tool to enforce some of them so we don’t have to manually check them.

2 Likes

Thanks a lot Rubén,

I have some other questions:

  • How did the other languages managed? is there any workaround?
  • Why is this procedure and software mandatory now for Spanish language?

Understand that is quite weird that the third (maybe second or fourth) most spoken language in the world is not included yet…

Best Regards,
Mar

Most languages that reached enought number of sentences to be in the voice phase (more than 5000) were the ones that provided that many strings during the campaign we run a few months ago. Unfortunately Spanish did not gather a lot (just a few hundreds).

There are other languages that were using github pull requests but currently we have found that some of them will need a cleanup in order to be useful (for the machine learning engine) and that’s why from now on we want to make sure the sentences we include are properly reviewed to avoid having to do another cleanup in the future.

The good news is that this should not stop any efforts collecting sentences, in early January, as soon as we have the tool ready we will be able to mass import them and use the tool for doing a proper review and approval.

A post was split to a new topic: Multilanguage site localization

Sorry for nitpicking, but CC0 is not Public Domain, according to Creative Commons wiki:
https://wiki.creativecommons.org/wiki/CC0_FAQ
I think it could be more accurate to say that the project accepts sentences from CC0 and Public domain sources.

1 Like

A post was split to a new topic: Can a folk tales compilation/book be a valid source?

Hi @jumasheff I consulted with our legal team to make sure I was giving you the best answer and they let me know that CC0 is a dedication to the public domain (to the maximum extent possible). In fact, the license is called “Public Domain Dedication.” Let me know if you have any specific questions about this or would like further clarification.

1 Like

I want to note that we have just launched the sentence collection tool