Are there Common Voice alternatives?

Hi,

I’m in a talk with a local government institution, trying to get them involved with Common Voice. So, they had these questions about alternatives - if there are any; why is CV better?

Tried searching around but couldn’t find any alternatives, let alone something better than CV.

But I’m not knowledgeable in the field so want to double check. If you have answers to these questions please do share.

(I know about Google’s Crowdsource, but 1. the data is not open source, and 2. they don’t even have tasks for my, Georgian language).

2 Likes

Hey @Razmik-Badalyan, Coqui.ai keeps a rather long list in this repo, it is of course not complete and not up-to-date, but it will give you the idea:

For me, the following are important wrt CV (selected ones from a rather large set):

  1. It is non-profit, open-source, and has a CC-0 license
  2. I love the idea behind it. Just read the about page…
  3. CV’s goal is to collect all languages with all dialects, accents etc. The main emphasis is on low-resource languages, even dying languages that are spoken by 1000 elderly people (there are ~7000 languages, 2500 are dying, and only a few can go to the digital age). That part is similar to the Rosetta project. Company-based datasets are focused on “customers” from whom they can earn money, thus, they only care for languages with large communities. Some of the data they have is also biased (e.g. biased to white educated males - their userbase, their performance drops in other segments). One of the CV’s main focuses is to get rid of this kind of bias by mobilizing volunteer language communities, campaigns, etc.
  4. Similarly, as they are for-profit, they get the right to record your data (see EULA) and they sell your data back to you. You acknowledge that “by using their product”, not voluntarily, so you donate your voice for them to get more money (also better product). Mozilla is very privacy-focused, you have to volunteer to donate your voice.
  5. For academics, innovators, hobbyists, etc, buying is usually not an option, so the only datasets they can use are open-source ones. Most of the datasets given above are produced by a limited-time project, and all have a limited number of languages. CV is the only one (that I know of) continuously expanding to serve humans. Our voices for ourselves.

These are solely my views, others can have different ideas…

2 Likes

Hi Bülent, thank you very much for your comment, it’s informative.

3 Likes