Multilanguage site localization

kristian · January 10, 2019, 7:54am

My question is about the policies for language use in the localization of CommonVoice, specifically whether I can incorporate some helpful phrases in another language in my translation/localization.

I am adding a very small language, Votic, with less than 10 speakers. The reason I want to add the language is that it would help revitalization attempts by 1) introducing a beautiful website in the language and 2) it will activate the speakers and learners to actually speak the language.

My problem is that almost none of the computer users know the language too well. So ideally some parts of the localization of the page (the longer chunks of texts introducing the project) would have explanations in the majority language they know (i.e Russian). These explanations could be in brackets or be in a paragraph next to the Votic paragraph.

I hope I made myself clear. This is not a technical question, basically I just want to have the longer parts of the localization text that are hard to understand for the users not only be in Votic but have some Russian to be more easier to understand.

r_LsdZVv67VKuK6fuHZ_tFpg · January 11, 2019, 11:25pm

Hi Kristian - That might be possible but I am not sure. @pmo will have a better answer on this topic and may have some follow up questions.

txopi · January 13, 2019, 4:25pm

As far as I know, the goal is to collect 10.000 ours of voice data per language. If Votic has less than 10 speakers, unfortunately is not possible to collect enough recordings to build a viable speech recognition system. I’m sorry.

pmo · January 14, 2019, 5:36am

@kristian, as @txopi suggested, even if you involve all the native speakers out there, it will take everyone a few years, each, doing it full time to reach that goal. Before recording, there needs to be a minimum of 5000 sentences collected from public domain. Technically, what you are suggesting is doable, but we need to figure out what message to put there. However, most people don’t spend too much time on a page they don’t read. It needs marketing effort to promote Votic.

Let’s explore your plan through email. I have written to you about working in Pontoon. Let me know if you have questions using Pontoon.

kristian · January 14, 2019, 8:02am

@txopi it depends on what you are aiming with the technology. Applications with very limited vocabulary (yes-no answers to questions, simple directions, calculator) can work on just a few hours of training data. And there’s work with good results on transfer-learning, where it’s possible to use training data from more than one language.

Yes, @pmo, it will take years. From our point of view we are happy if the users speak the language at all and I believe the Common Voice platform will motivate them to speak. There’s plenty of related languages that are in a similar situation and I believe it is a good use case for Common Voice as it is doing common good and definitely is good marketing of the tool and the other languages it targets since all speakers of these hyper small languages are bilingual.
Thank you, let’s explore further via email.

josh_meyer · February 8, 2019, 10:50pm

This response is directed to the wider Common Voice community, @txopi, and is meant to be constructive.

While 10,000 hours is the goal for languages interested in speech-to-text and other speech technologies, this is not a reason to exclude minority languages with few speakers from contributing. Common Voice is a platform for collecting speech recordings, and people contribute for different reasons.

It is true that currently most people are interested in Common Voice as a dataset for training speech recognition, but other contributors and speech communities have different agendas.

Furthermore, 10,000 hours is how much data you currently need for end-to-end systems like DeepSpeech, but you can still create useful technologies with much less data. For example, with just a few hours of data you can make a very accurate “yes” vs. “no” classifier.

We encourage all languages to contribute to Common Voice, no matter how many people speak the language.

Topic		Replies	Views
Streamlining Localization and Reducing Barriers for Common Voice Communities Common Voice	3	598	May 22, 2024
Feedback: Enabling small languages Common Voice feedback	7	2252	February 27, 2019
Languages addressed Common Voice	24	3827	May 15, 2018
Volunteer to help to add Sanskrit and Kannada languages in the Common Voice project Common Voice participation	2	1044	December 16, 2020
Enable Sinhala on contributing to collect and review dataset for Mozilla Common Voice Common Voice l10n	3	1827	April 8, 2019

Multilanguage site localization

Related topics