New Language Support in Common Voice

Shadman_Taqi · May 20, 2020, 3:26pm

Can you add Bengali in Deep Speech Languages? I think it is important!
With 228 million Native Users and 37 million second-language speakers, Bengali is the fifth most-spoken native language and the seventh most spoken language by the total number of speakers in the world. Ref: Bengali language

As it is my native language, I will be able to support you. Let me know how can I help in the process.

nukeador · April 30, 2020, 11:10am

Bengali is already in progress for localization

https://pontoon.mozilla.org/bn/common-voice/

I see there are only 598 sentences from our sentence collector and no existing efforts to get wikipedia sentences (this should come first).

Please see this topic for reference:

mm.crjx · May 28, 2020, 1:51pm

@nukeador I have started the process of extracting sentences from the wiki dump. The repo is here. I would need some help to figure out the language rules and blacklist. I think the issue is common-voice-wiki-scraper treating . as a sentence terminator, which is not the case in Bengali. Even the bengali common voice website has issues. How to I reach to the translators? Should I open issues in voice-web? May be this is asking for too much but a riot channel(like the other common voice communities) for the contributors would help this effort a lot. How do I go about that?

nukeador · May 28, 2020, 2:18pm

You can ask about the rules on the sentence extractor topic or on the matrix room about the tool.

You can reach out Bengali localizers and reviewers from

https://pontoon.mozilla.org/bn/common-voice/contributors/

mm.crjx · June 30, 2020, 8:43pm

I have made some progress regarding scraping bn.wiki, a more or less complete rules(bn.toml) file has been created with help from @mkohler. Although it is only generating 410-20 sentences from 90k articles on Bengali Wikipedia. Which seems low compared to the 4.5k sentences out of 139k articles on Hindi Wikipedia. Maybe its an issue with the sentence tokeniser.

Nevertheless, there is another issue with the blocklist creation, because cvtools is hardcoded for the roman alphabet. This word statistics generation /blacklist generation tool will be particularly useful for scraping more public domain sources. So can anyone suggest an alternative tool, I am sure there are plenty. It wouldn’t be productive to try to modify cvtools if a good alternative already exists.

All the files related to the progress can be found in the repo mentioned earlier.

nukeador · July 1, 2020, 11:06am

@dabinat might have some thoughts here

Arijit_Mukherjee · July 2, 2020, 1:45pm

https://github.com/Common-Voice/cv-sentence-extractor/blob/master/src/extractor.rs#L127

It basically falls back to English tokenizer which is raising the issue that it only picks up sentences with “.” ending.

dabinat · July 4, 2020, 3:04am

@nukeador I will take a look at this over the weekend.

dabinat · July 6, 2020, 5:18am

@mm.crjx Can you try the new “more-characters” branch of cvtools and see if it solves the problem for you?

mm.crjx · July 23, 2020, 8:03am

@dabinat I am sorry for the delay, was busy with an exam. It does work, although it has numbers in it, have to find a way to eliminate those.

dabinat · July 23, 2020, 6:21pm

@mm.crjx Ok, I will look into this. The best solution may be to allow users to specify an alphabet like DeepSpeech.

mm.crjx · July 24, 2020, 8:23pm

Yeah I was thinking of the same. Lets see, thanks anyways.

Oymate · August 26, 2020, 8:50am

So, update on this issue?
I really need a Bangla dataset.

Oymate · December 10, 2020, 5:54am

Given the difficulty of extracting using rules file. I think we should use public domain sentences(old books, newspaper) sentences in the meantime

mm.crjx · February 14, 2022, 1:39pm

@Oymate, @Shadman_Taqi we are organizing people for the data collection. May be you can join us at Discord or Matix

mm.crjx · February 21, 2022, 4:08pm

We at common voice for Bengali have successfully started to organize people to donate voice data. Which is evident from the graph below.

However, we are facing an issue of time out for longer sentences. So what is the default time allotted for each sentence, and can we extend that?

bozden · February 21, 2022, 11:09pm

This has been answered in Matrix chat but those fly away, so here it is:

AFAIK the recordings should be min 1.5s, max 14s. This is a global setting for CV and changing it would effect all languages. So the solution is not there but in sentence collector: Language based validator:

Limit word count (default 14 for English - which will be used for languages which do not have a specific validator).
Limit character count (Usually 100-110 will work fine but this is language dependent).

You can time some sentences by different people and calculate secs/word secs/char etc to put the validator limits.

I tested it up to 135 chars (with 14 word limit) and some people (who speak with good accent/emphasis, elderly people etc) had problem fitting it in 14 sec.

Limiting sentences like this would drop your text-corpus possibilities, so you may like to pre-edit them to divide sentences by “:” for example. As these sentences should be CC0, you can do anything with them, but they should be correct of course.

Another side-note: Very long recordings are usually not good for many voice-AI models. For example Coqui limits the recording length to 10 seconds by default. So long sentences will be thrown away anyway…