Multilanguage site localization


(Kristian) #1

My question is about the policies for language use in the localization of CommonVoice, specifically whether I can incorporate some helpful phrases in another language in my translation/localization.

I am adding a very small language, Votic, with less than 10 speakers. The reason I want to add the language is that it would help revitalization attempts by 1) introducing a beautiful website in the language and 2) it will activate the speakers and learners to actually speak the language.

My problem is that almost none of the computer users know the language too well. So ideally some parts of the localization of the page (the longer chunks of texts introducing the project) would have explanations in the majority language they know (i.e Russian). These explanations could be in brackets or be in a paragraph next to the Votic paragraph.

I hope I made myself clear. This is not a technical question, basically I just want to have the longer parts of the localization text that are hard to understand for the users not only be in Votic but have some Russian to be more easier to understand.

📖 Readme: How to see my language on Common Voice
(LRSaunders) #2

Hi Kristian - That might be possible but I am not sure. @pmo will have a better answer on this topic and may have some follow up questions.

(Txopi) #3

As far as I know, the goal is to collect 10.000 ours of voice data per language. If Votic has less than 10 speakers, unfortunately is not possible to collect enough recordings to build a viable speech recognition system. I’m sorry.

(Peiying) #4

@kristian, as @txopi suggested, even if you involve all the native speakers out there, it will take everyone a few years, each, doing it full time to reach that goal. Before recording, there needs to be a minimum of 5000 sentences collected from public domain. Technically, what you are suggesting is doable, but we need to figure out what message to put there. However, most people don’t spend too much time on a page they don’t read. It needs marketing effort to promote Votic.

Let’s explore your plan through email. I have written to you about working in Pontoon. Let me know if you have questions using Pontoon.

(Kristian) #5

@txopi it depends on what you are aiming with the technology. Applications with very limited vocabulary (yes-no answers to questions, simple directions, calculator) can work on just a few hours of training data. And there’s work with good results on transfer-learning, where it’s possible to use training data from more than one language.

Yes, @pmo, it will take years. From our point of view we are happy if the users speak the language at all and I believe the Common Voice platform will motivate them to speak. There’s plenty of related languages that are in a similar situation and I believe it is a good use case for Common Voice as it is doing common good and definitely is good marketing of the tool and the other languages it targets since all speakers of these hyper small languages are bilingual.
Thank you, let’s explore further via email.


This response is directed to the wider Common Voice community, @txopi, and is meant to be constructive.

While 10,000 hours is the goal for languages interested in speech-to-text and other speech technologies, this is not a reason to exclude minority languages with few speakers from contributing. Common Voice is a platform for collecting speech recordings, and people contribute for different reasons.

It is true that currently most people are interested in Common Voice as a dataset for training speech recognition, but other contributors and speech communities have different agendas.

Furthermore, 10,000 hours is how much data you currently need for end-to-end systems like DeepSpeech, but you can still create useful technologies with much less data. For example, with just a few hours of data you can make a very accurate “yes” vs. “no” classifier.

We encourage all languages to contribute to Common Voice, no matter how many people speak the language.