Multi-language Update for Common Voice

Hello everyone,

A lot of people have been asking us through Discourse and other mediums like Slack about when Common Voice will be available in their language. Well, this update is for all of you!

First, the big news: we are aiming to launch multi-language Common Voice by the week of May 7.

However, there is still a lot of work to do before we can collect voices in any new language, and we need your help!

The first part we need help with is translating the content of the website. Some of you may have noticed that this work has already begun in Pontoon, Mozilla’s tool used to help translate Firefox and Mozilla.org. If you would like to join the translation effort for Common Voice, or add a new language, please reach out to me at mikey(at)mozilla.com.

Second, we will need sentences in these new languages for people to read into Common Voice. This part is a little more tricky, as there are many considerations for creating a complete group of sentences for people to read. That is why we are working with a group of language and speech experts (which we’re calling the Common Voice Speech Advisory Group) to make some guides for collecting and/or writing these sentences. We expect to have that guide up in a public location by the end of April.

In the meantime, you can help us look for sentences to read in your language by searching for public domain texts. Possible sources include copyright free material like government proceedings, movie or drama scripts, perhaps radio or podcast transcripts, etc. We hope to have more information soon about how to look for this public material, but in the meantime feel free to use this thread to ask questions and discuss places to look.

Lastly, we would like to thank everyone for their interest and help with the Common Voice project so far. Without you, this project couldn’t exist. Already we are seeing the English Common Voice data being used in speech engines and university research. We hope that by going multi-lingual we will empower whole new communities to take part in voice technology. Help us make that happen!

With more to come soon,
Michael & the Common Voice team

9 Likes

A couple of quick questions – can Wikipedia content be used? The reason I am asking this is because for many languages, unlike English and other widely spoken languages, finding Public Domain content is almost impossible for various reasons. My own language Odia, and many languages from India do not have much public domain content in typed text form—because of lack of massive digitization efforts, and the default license for public-funded projects being copyrighted. I hate to remember this fact that our government proceedings are all copyrighted like any other copyrighted content because the government websites clearly say “All Rights Reserved”. Wikipedia content sort of comes as a savior when it comes to openly-licensed content. Considering the complexity of such issues for many languages that do not have PD content, can you folks reconsider about allowing CC-BY-SA content as well, if not for all languages, at least for some languages?

Look forward to checking the guides that you’re planning to release.

1 Like

Hi Subha,

Thanks for asking this question. I’m very sad to say that under the current license structure we cannot use wikipedia content. This this comment on github (from a wikimedia lawyer) for context:

To answer your question though, we may consider supporting more license types in the future. But this is a non-trivial amount of work on our part (from a legal, engineering, and user experience perspective). We will first try to collect public domain material, and if this turns out to be a big blocker we will consider doing this re-licensing work.

2 Likes

Can you please list what all Indian languages will be available from May 7th, so that we can be prepared.

Thanks
Ranjith Raj

Are manually sourced sentences still an option like it was for English? I have already gathered quite some Dutch sentences (all written by myself) in Github issues (https://github.com/mozilla/voice-web/issues/213) and have some more stored locally.

Where can we best dump this information? Are we supposed to make one github issue/discourse thread per language?

I guess there are also some old books in Dutch that are currently in the public domain. However, Dutch is an active language meaning that some of the words are not used anymore or their meaning might have changed. Is it recommended to still use these sources or should we ignore them?

1 Like

Hi @jef.daniels, that’s awesome! Those sentences are very much an option, as long as they are public domain (CC-0). I’ll start working on multi-language contribution this week and it looks like it’ll start the same way we’re doing it for english. I.e. sentences stored in (possibly multiple) txt-files with sentences separated by line-breaks.

We’ll have more information on how to exactly add that to the project soon.

Thanks!

There was a similar discussion around old english works, the position there was that (like you said) that writing doesn’t reflect the way the language is currently spoken enough and might also be problematic (maybe old dutch writing is more progressive in that regard :grin: ). Hence for the current purpose of this project, we didn’t include such samples.

For German, the protocols of the parliament/Bundestag might be sufficient
They are provided here https://offenesparlament.de/ under CC-0 license.

If I scroll to the bottom of that page, it tells me that the content is under CC 3.0 Attribution and not CC-0 :face_with_raised_eyebrow:
Am I missing something? :grin:

I think this refers to the website itself, not the raw parliamentary protocols data. The relevant text is „Alle Daten sind unter CC0 als Open Data frei zugĂ€nglich und können hier heruntergeladen werden“ on their data page https://offenesparlament.de/daten/

1 Like

Wikipedia might not be a good choice but we can definitely use many public domains books and other content from Wikisource for local languages.

Example:
CC0 license book in Telugu
https://te.wikisource.org/wiki/à°•à±à°Ÿà±à°‚à°Ź_à°šà°żà°Żà°‚à°€à±à°°à°Ł_à°Șఊ్ధఀుà°Čు

1 Like

Yes, maybe we need a script to scrap from various languages there. The only issue I found is the language is in general outdated (books are from the 1800’s)

True to some extent, lot of those public domain works dates back 70, 80 years. We simply can not take every book. Someone who knows the language needs to check manually and list out Wikisource Books that can be used in this project.

1 Like

It would be great, if there will be an option to translate existing sentances to other languages. Also creating a relation between those sentences.
Those senteces can be use for both, voice recognition and translation to other languages.

WikiSource provides metadata including the date. I’m planning on making an importing tool. I already have some for Project Gutenberg (tailored for french, but patches are welcome): https://github.com/lissyx/commonvoice-syceron/blob/master/project-gutenberg.py

Am i missing something or is the multi-language update still not out?
I would like to train with a german data set, but just find the english one.

@nils.kuc you need to change the language to German. I think it is one of the issues on Github, but for now that is the only way.

Yes, we haven’t released any data for languages other than English. See this topic: