New Language Support in Common Voice

Can you add Bengali in Deep Speech Languages? I think it is important!
With 228 million Native Users and 37 million second-language speakers, Bengali is the fifth most-spoken native language and the seventh most spoken language by the total number of speakers in the world. Ref: Bengali language

As it is my native language, I will be able to support you. Let me know how can I help in the process.

Bengali is already in progress for localization

https://pontoon.mozilla.org/bn/common-voice/

I see there are only 598 sentences from our sentence collector and no existing efforts to get wikipedia sentences (this should come first).

Please see this topic for reference:

2 Likes

@nukeador I have started the process of extracting sentences from the wiki dump. The repo is here. I would need some help to figure out the language rules and blacklist. I think the issue is common-voice-wiki-scraper treating . as a sentence terminator, which is not the case in Bengali. Even the bengali common voice website has issues. How to I reach to the translators? Should I open issues in voice-web? May be this is asking for too much but a riot channel(like the other common voice communities) for the contributors would help this effort a lot. How do I go about that?

You can ask about the rules on the sentence extractor topic or on the matrix room about the tool.

You can reach out Bengali localizers and reviewers from

https://pontoon.mozilla.org/bn/common-voice/contributors/

I have made some progress regarding scraping bn.wiki, a more or less complete rules(bn.toml) file has been created with help from @mkohler. Although it is only generating 410-20 sentences from 90k articles on Bengali Wikipedia. Which seems low compared to the 4.5k sentences out of 139k articles on Hindi Wikipedia. Maybe its an issue with the sentence tokeniser.

Nevertheless, there is another issue with the blocklist creation, because cvtools is hardcoded for the roman alphabet. This word statistics generation /blacklist generation tool will be particularly useful for scraping more public domain sources. So can anyone suggest an alternative tool, I am sure there are plenty. It wouldn’t be productive to try to modify cvtools if a good alternative already exists.

All the files related to the progress can be found in the repo mentioned earlier.

@dabinat might have some thoughts here :slight_smile:

https://github.com/Common-Voice/cv-sentence-extractor/blob/master/src/extractor.rs#L127

It basically falls back to English tokenizer which is raising the issue that it only picks up sentences with “.” ending.

@nukeador I will take a look at this over the weekend.