Enable Sinhala on contributing to collect and review dataset for Mozilla Common Voice

Hi,

I would like to initiate this Common Voice project, in Sri Lanka.
Initially we can start with Sinhala (si) language (we need to discuss on Tamil language, ta_LK separately to avoid confusion with ta_IN).

What kind of information you need from our side?
What is the minimum size of Sinhala text required to extract sentences?
What are the recommended recording environments and how many samples required per sentence?
Is there a recommended guide line?

Regards,
Danishka

2 Likes

Hi @danishka, thank you for your interest.

We plan to start localizing the site into multiple languages very soon, and part of that will be figuring out the answers to your questions here. But as a first step, collecting a large set of public domain Sinhala sentences would help us kick start the process. I’ll try to answer the rest of your questions below.

What kind of information you need from our side?

Nothing just yet :slight_smile:

What is the minimum size of Sinhala text required to extract sentences?

I think we need at least of couple thousand sentences to get started. But ideally we have several hundred thousand, or even a million or more.

What are the recommended recording environments and how many samples required per sentence?

Recording at home, or on the go through the website and app should be all the environments we need. We want a large variety, nosy/quiet, computer/phone, home/out, all are good!

Is there a recommended guide line?

Not yet, but part of the process for localizing the site will be coming up with this guideline. Stay tuned!

This could be done. I can help too. I am a native Sinhala Speaker.

As a reminder this topic has all the information about how to get a language included in Common Voice:

Cheers.

1 Like