Languages addressed

mhenretty · February 22, 2018, 7:21pm

You all are so fast! Yes, indeed we are in Pontoon, and some people have started translations.

But, this is still very “Beta.” We will need to make some site changes before we launch multi-language (like a way to select your language for instance). So translating now won’t be the final step. It’s just a way to get ahead of the game.

As to collecting sentences. We are starting to put together a guide for how to do this properly. Expect that in the next month or so. Here is a list of starting questions we are asking ourselves:

github.com

NLPH/NLPH/blob/master/common_voice/data_collection_methodology.md

# Methodology for data collection for voice corpora

Starting with rough points to address:

- Sentences should come from an open data source, preferrably public domain.
  - What type of source should be preferred? Written? Transcripts of spoken? News, articles, books, social media?
  - What type of register should be preferred?
  - How modern should be the source? Is there any use for older data sources?

- The data set should be constructed to provide good coverage of different language components and aspects:
  - Words
    - Inflections: How good a coverage can and should we aim for? Which types of inflections (gender, tense, number, person, mood, etc.)?
  - Phonemes? 
  - Accents? Gender or age of speakers? Geographical origin? What is even feasible here?
  - Are any properties of word and/or phoneme frequency distribution in the language be kept? If so, distribution measured/obtain on what corpora (should they be estimated on written corpora)?  

- Should data on contributers be collected and kept, if possible? If so, which? Age, gender, etc.

- A train/validation/test split should be determined beforehand.
  - What is a proper ratio?

This file has been truncated. show original

Gman · March 5, 2018, 10:07pm

Hi @mhenretty,

are those questions discussed somwhere?

Can you explain as well how this is linked to the internationalisation of the website? it seems to be general questions that need to be answered for an english corpus as well so don’t exactly understand why it need to be answered prior to collect data.

lissyx · March 7, 2018, 1:53pm

I think the link is that this has been (at least partially) solved already for English, while it needs to be taken care of by community for their own languages, and it is a very important part of the work.

jtane · March 8, 2018, 8:13am

@lissyx as a newbie as to the mozilla comunity work… do you have pointers as to where to go or where to ask as to helping the community take care of this…?

lissyx · March 8, 2018, 8:39am

This is exactly what we are working on right now, and we were discussing with @hellosct1 yesterday: we need a place to take care of all of that

jtane · March 8, 2018, 9:43am

Great! if you find a place… please tell me.
I look forward to it.

lissyx · March 13, 2018, 6:13pm

@Gman We have been starting to set things up here: https://github.com/mozfr/besogne/wiki/Common-Voice-fr

jtane · March 25, 2018, 11:10pm

Hello @mhenretty and lissyx,

I would like to work on setting things for CommonVoice for other languages than English.

In particular, French and German are of interest to me.

It seems that for French… you and others are starting fine.

Is it possible to have something like:

https://voice.mozilla.org/de/
https://voice.mozilla.org/fr/
?

which then displays the content in the corresponding language so that the data can be stored in the corresponding language.

mhenretty · March 26, 2018, 2:56pm

Hi @jtane,

Thanks for your interest. Indeed this is exactly what we are working on, and we already have some of this work set on our staging site:

But the biggest missing piece is collecting sentences for people to read in the new languages. I’m working on some guides for doing this now, but it will probably take a few more weeks before that’s ready. Stay tuned for that!

jtane · March 26, 2018, 3:41pm

Great. As I see we need examples of phrases to be spoken.

Is there already a pool of example for the different languages ?

mhenretty · March 26, 2018, 4:02pm

we only have english for now, but here are some topics we are considering:

github.com

NLPH/NLPH/blob/master/common_voice/data_collection_methodology.md

# Methodology for data collection for voice corpora

Starting with rough points to address:

- Sentences should come from an open data source, preferrably public domain.
  - What type of source should be preferred? Written? Transcripts of spoken? News, articles, books, social media?
  - What type of register should be preferred?
  - How modern should be the source? Is there any use for older data sources?

- The data set should be constructed to provide good coverage of different language components and aspects:
  - Words
    - Inflections: How good a coverage can and should we aim for? Which types of inflections (gender, tense, number, person, mood, etc.)?
  - Phonemes? 
  - Accents? Gender or age of speakers? Geographical origin? What is even feasible here?
  - Are any properties of word and/or phoneme frequency distribution in the language be kept? If so, distribution measured/obtain on what corpora (should they be estimated on written corpora)?  

- Should data on contributers be collected and kept, if possible? If so, which? Age, gender, etc.

- A train/validation/test split should be determined beforehand.
  - What is a proper ratio?

This file has been truncated. show original

jtane · March 26, 2018, 8:00pm

So after I looked at the github repo for voice-web and understood how it works.

How does a language team provides you with data ?
How is this data to be shared. Which data is preferred for this?
3 where does the discussion on the diverse topics you mentioned occur?
4 One thing I noticed is that in the list of accents there are not native speakers… is it not a good idea to add an entry “non native speaker”
5 I had the impression that the accent information was not provided in the feedback data, wouldn’t it be a good idea ?

mhenretty · March 27, 2018, 5:44am

Which data are you referring to, sentences to be read? If so, any format works really. An email or pastebin, or github gist. It doesn’t matter.

See above.

I’m not sure what you mean by diverse topics, but this is a good place for discussions.

see https://github.com/mozilla/voice-web/issues/242

Not sure what you are asking here, but accent is definitely in the final download package.

Gman · May 14, 2018, 9:30pm

Hello @lissyx,

I finally had th tim to take a look. Looks like things are progressing well!
Thanks for the wiki…but I don’t have a smartphon so can’t get to telegram

Is there any other way to coordinate and help?

lissyx · May 15, 2018, 6:31am

You should be able to join Telegram as long as you have just a phone number, no smartphone is required, just the web version: https://web.telegram.org/

Gman · October 24, 2019, 2:14pm

that’s what I thought as well but it’s not possible, you need an android or Iphone

lissyx · May 15, 2018, 9:43am

That’s strange or new, because I could do that from B2G in the past, using the web client :-(.

lissyx · May 15, 2018, 9:46am

I just have been able to join telegram, from the web interface. Received a SMS code to confirm and it was okay.

Gman · May 15, 2018, 10:50am

Probably because you already have an account

lissyx · May 15, 2018, 11:13am

Nope, that was with a new phone number :). Sorry, I cannot help more :/.