Mozilla Voice Community Playbook: The source of truth for setting up and maintain self-sustainable communities.
I would like to open this topic to summarize some of the most asked question we are getting: How do I get my language in Common Voice.
There are three steps to have your language ready:
Have the website localized over pontoon
Skills needed: English knowledge, strong knowledge of your language.
Gather a lot of sentences under public domain (CC-0)
- We recommend to use our sentence extractor on your wikipedia as the first source.
- You can also use the EuroParliament corpus.
Skills needed: Command line usage and git, familiar with regular expressions.
Submit and review more sentences from other sources (not wikipedia)
To be incorporated into the database using the Sentence Collector tool.
Skills needed: Strong grammar knowledge of the target language you are contributing to.
If you have found an existing public domain corpus bigger than 100K sentences, we have an independent process to handle it, since we understand that manual validation using the sentence collector is not ideal.
Please create a new topic here so we can evaluate if your corpus fits the license and size requirements to run this process.
Skills needed: Expertise processing and cleaning up text, linguistics/language expertise to check the quality of the resulting sentences.
Once you have enough validated and reviewed sentences (usually over 5000), we can enable a language to accept voice recording on the site and you might wonder My language is now collecting voice, what do I need to know?
Please note you will have to keep adding sentences to be able to allocate more recordings without repetitions.
Feel free to add any questions to this topic and we will be happy to support you