New Language Support in Common Voice

Can you add Bengali in Deep Speech Languages? I think it is important!
With 228 million Native Users and 37 million second-language speakers, Bengali is the fifth most-spoken native language and the seventh most spoken language by the total number of speakers in the world. Ref: Bengali language

As it is my native language, I will be able to support you. Let me know how can I help in the process.

1 Like

Bengali is already in progress for localization

I see there are only 598 sentences from our sentence collector and no existing efforts to get wikipedia sentences (this should come first).

Please see this topic for reference:


@nukeador I have started the process of extracting sentences from the wiki dump. The repo is here. I would need some help to figure out the language rules and blacklist. I think the issue is common-voice-wiki-scraper treating . as a sentence terminator, which is not the case in Bengali. Even the bengali common voice website has issues. How to I reach to the translators? Should I open issues in voice-web? May be this is asking for too much but a riot channel(like the other common voice communities) for the contributors would help this effort a lot. How do I go about that?

1 Like

You can ask about the rules on the sentence extractor topic or on the matrix room about the tool.

You can reach out Bengali localizers and reviewers from

I have made some progress regarding scraping, a more or less complete rules(bn.toml) file has been created with help from @mkohler. Although it is only generating 410-20 sentences from 90k articles on Bengali Wikipedia. Which seems low compared to the 4.5k sentences out of 139k articles on Hindi Wikipedia. Maybe its an issue with the sentence tokeniser.

Nevertheless, there is another issue with the blocklist creation, because cvtools is hardcoded for the roman alphabet. This word statistics generation /blacklist generation tool will be particularly useful for scraping more public domain sources. So can anyone suggest an alternative tool, I am sure there are plenty. It wouldn’t be productive to try to modify cvtools if a good alternative already exists.

All the files related to the progress can be found in the repo mentioned earlier.

@dabinat might have some thoughts here :slight_smile:

It basically falls back to English tokenizer which is raising the issue that it only picks up sentences with “.” ending.

@nukeador I will take a look at this over the weekend.

1 Like

@mm.crjx Can you try the new “more-characters” branch of cvtools and see if it solves the problem for you?

@dabinat I am sorry for the delay, was busy with an exam. It does work, although it has numbers in it, have to find a way to eliminate those.

@mm.crjx Ok, I will look into this. The best solution may be to allow users to specify an alphabet like DeepSpeech.

Yeah I was thinking of the same. Lets see, thanks anyways. :sweat_smile:

So, update on this issue?
I really need a Bangla dataset.

Given the difficulty of extracting using rules file. I think we should use public domain sentences(old books, newspaper) sentences in the meantime