Languages addressed

I am wondering whether there will be support for other languages such as German, French, Italian and so on. For English there is already many hours (>1000) of free speech data available, e.g. from the LibriSpeech (http://www.openslr.org/12/) project or from VoxForge (http://www.voxforge.org). However, for all other languages the data situation is much worse, so there might be a much higher need for collecting data for other languages. For English, the best thing to do is probably to integrate an ASR system into Mozilla and collect real user data.

4 Likes

Agreed with everything you said. We definitely want to open Common Voice to more languages, but right now we are building out the v1 in English. We have a goal to start with a second language before the end of the year. Stay tuned!

4 Likes

Hi, sorry if I re-open this topic, but I found just this one, talking about the possibility to add other main languages. I’m interested in helping with the Italian one, being a voice actor and a dubbing director.
I was guided here from mycroft.ai, because they’re counting on your project for integrating your results in their work.
Thanks

Hello @mhenretty ,

Would also be happy to contribute but not sure where to begin. There is a lot of topics regarding localisation and that will happen soon but without any other precision about the concerned languages and when it should be available.

On the github repository I found this file https://github.com/mozilla/voice-web/blob/master/web/locales/en/messages.ftl regarding the localisation of the website should we begin to translate it and propose a PR?

I suppose you will as well need short sentences for people to read but can we submit our sentences in other languages here https://github.com/mozilla/voice-web/issues/341?
or should we create a new issue per language?

Regards,

G

1 Like

For those who want to help in the translation of teh website, thy can go to https://pontoon.mozilla.org/fr/common-voice/messages.ftl/
(obviously you will need to choose your own language prior to begin translation)

You all are so fast! Yes, indeed we are in Pontoon, and some people have started translations.

But, this is still very “Beta.” We will need to make some site changes before we launch multi-language (like a way to select your language for instance). So translating now won’t be the final step. It’s just a way to get ahead of the game.

As to collecting sentences. We are starting to put together a guide for how to do this properly. Expect that in the next month or so. Here is a list of starting questions we are asking ourselves:

2 Likes

Hi @mhenretty,

are those questions discussed somwhere?

Can you explain as well how this is linked to the internationalisation of the website? it seems to be general questions that need to be answered for an english corpus as well so don’t exactly understand why it need to be answered prior to collect data.

1 Like

I think the link is that this has been (at least partially) solved already for English, while it needs to be taken care of by community for their own languages, and it is a very important part of the work.

@lissyx as a newbie as to the mozilla comunity work… do you have pointers as to where to go or where to ask as to helping the community take care of this…?

This is exactly what we are working on right now, and we were discussing with @hellosct1 yesterday: we need a place to take care of all of that :slight_smile:

Great! if you find a place… please tell me.
I look forward to it.

@Gman We have been starting to set things up here: https://github.com/mozfr/besogne/wiki/Common-Voice-fr

1 Like

Hello @mhenretty and lissyx,

I would like to work on setting things for CommonVoice for other languages than English.

In particular, French and German are of interest to me.

It seems that for French… you and others are starting fine.

Is it possible to have something like:

https://voice.mozilla.org/de/
https://voice.mozilla.org/fr/
?

which then displays the content in the corresponding language so that the data can be stored in the corresponding language.

Hi @jtane,

Thanks for your interest. Indeed this is exactly what we are working on, and we already have some of this work set on our staging site:

But the biggest missing piece is collecting sentences for people to read in the new languages. I’m working on some guides for doing this now, but it will probably take a few more weeks before that’s ready. Stay tuned for that!

1 Like

Great. As I see we need examples of phrases to be spoken.

Is there already a pool of example for the different languages ?

we only have english for now, but here are some topics we are considering:

So after I looked at the github repo for voice-web and understood how it works.

  1. How does a language team provides you with data ?
  2. How is this data to be shared. Which data is preferred for this?
    3 where does the discussion on the diverse topics you mentioned occur?
    4 One thing I noticed is that in the list of accents there are not native speakers… is it not a good idea to add an entry “non native speaker”
    5 I had the impression that the accent information was not provided in the feedback data, wouldn’t it be a good idea ?

Which data are you referring to, sentences to be read? If so, any format works really. An email or pastebin, or github gist. It doesn’t matter.

See above.

I’m not sure what you mean by diverse topics, but this is a good place for discussions.

see https://github.com/mozilla/voice-web/issues/242

Not sure what you are asking here, but accent is definitely in the final download package.

Hello @lissyx,

I finally had th tim to take a look. Looks like things are progressing well!
Thanks for the wiki…but I don’t have a smartphon so can’t get to telegram :frowning:

Is there any other way to coordinate and help?

You should be able to join Telegram as long as you have just a phone number, no smartphone is required, just the web version: https://web.telegram.org/