Readme: How to see my language on Common Voice


(Rubén Martín) #1

Hello everyone,

I would like to open this topic to summarize some of the most asked question we are getting: How do I get my language in Common Voice.

There are 3 steps to have your language ready:

  1. Have the website localized over pontoon.
  2. Gather sentences under public domain
  3. Validate and review them to be incorporated into the database.

Once we have enough validated and reviewed sentences (usually over 5000), we can move a language to accept voice recording on the site.

Important: Currently step 3 is blocked because we are finishing our sentence collection tool to allow this review system, BUT, you can keep gathering sentences under public domain and submit them for validation and review once the tool is ready.

Please, add any questions to this topic and we will be happy to support you :slight_smile:


How can I send sentences to contribute?
Common voice persian support
Where should I go to contribute new sentences?
(Rubén Martín) #2

(Txopi) #3

I wonder if you could add Basque to the in progress group. We are translating the website (78% done until now) and locating the sources we can use to prepare the whole data source: https://librezale.eus/wiki/EdukiakJabetzaPublikoan

We don’t have to wait until we have collected 5.000 sentences, isn’t it? How we must send them to you, linking chunks in Discourse, making pull requests on GitHub?

Thank you!


(Rubén Martín) #4

Please, keep translating the website and collecting public domains sentences. As soon as we have the new tool ready you will be able to submit the sentences for peer-review.

If you are asking about the number of sentences to enable the voice collection on the site, yes, usually we need at least 5000 reviewed sentences.


(Radoslav Kolev) #5

Hello!

We are working on adding Bulgarian to Common Voice and are almost finished with translating the web interface.

About collecting the sentences can you give some more details for the format and requirements that will make things easiest afterwards?

Is a simple text file with one sentence per line OK? I had seen somewhere there are limits for minimum/maximum number of words in a sentence. It would be good to state these officially here.

Also when we are ready with the web translation do we need to notify someone for it to appear on the website or it will be activated automatically?

Thanks and best regards,
Radoslav


(Gregor) #6

Hi Radoslav,

thanks for reaching out, looking forward to having Bulgarian on the site (my sister’s husband is Bulgarian, so I have some skin in the game :grin: ).

Here’s the english sentence set as an example: https://github.com/mozilla/voice-web/tree/master/server/data/en
Indeed those are all text files, with one sentence per line.

Once the translation is done, you’d need to ping me and then we’ll activate it.

Best,
Gregor


(Luc Salommez) #7

Hello Radoslav,

About the number of words per sentence, around 15 words max should be great.
The idea is that a sentence should take between 3 and 5 seconds to read.

Best regards,
Luc.


(Radoslav Kolev) #8

Since the Bulgarian website is online already and we’ll start work on the sentence collection, one more question: What licenses are allowed? I’ve read on some other places about CC0 & Public domain? I guess that includes sentences authored by a submitter willing to license them using one of the two options above. Is attribution/author or source information required?


(Rubén Martín) #9

Yes, license needs to be Public Domain (CC-0) and we ask about the source so we can check it.

Thanks!