Can you add Bengali in Deep Speech Languages? I think it is important! With 228 million Native Users and 37 million second-language speakers, Bengali is the fifth most-spoken native language and the seventh most spoken language by the total number of speakers in the world. Ref: Bengali language A…

Bengali is already in progress for localization Common Voice · Bengali (bn) Mozilla’s Localization Platform I see there are only 598 sentences from our sentence collector and no existing efforts to get wikipedia sentences (this should come first). Please s…

@nukeador I have started the process of extracting sentences from the wiki dump. The repo is here . I would need some help to figure out the language rules and blacklist. I think the issue is common-voice-wiki-scraper treating . as a sentence terminator, which is not the case in Bengali. Even the b…

You can ask about the rules on the sentence extractor topic or on the matrix room about the tool . You can reach out Bengali localizers and reviewers from https://pontoon.mozilla.org/bn/common-voice/contributors/

I have made some progress regarding scraping bn.wiki, a more or less complete rules(bn.toml) file has been created with help from @mkohler . Although it is only generating 410-20 sentences from 90k articles on Bengali Wikipedia. Which seems low compared to the 4.5k sentences out of 139k articles on …

@dabinat might have some thoughts here :slight_smile:

https://github.com/Common-Voice/cv-sentence-extractor/blob/master/src/extractor.rs#L127 It basically falls back to English tokenizer which is raising the issue that it only picks up sentences with “.” ending.

@nukeador I will take a look at this over the weekend.

@mm.crjx Can you try the new “more-characters” branch of cvtools and see if it solves the problem for you?

@dabinat I am sorry for the delay, was busy with an exam. It does work, although it has numbers in it, have to find a way to eliminate those.

@mm.crjx Ok, I will look into this. The best solution may be to allow users to specify an alphabet like DeepSpeech.

New Language Support in Common Voice

Common Voice

nukeador (Rubén Martín [❌ taking a break from Mozilla]) April 30, 2020, 11:10am 2

Bengali is already in progress for localization

I see there are only 598 sentences from our sentence collector and no existing efforts to get wikipedia sentences (this should come first).

Please see this topic for reference:

Topic		Replies	Views
📖 Readme: How to see my language on Common Voice Common Voice announcements	35	14466	May 10, 2022
Volunteer to help to add Sanskrit and Kannada languages in the Common Voice project Common Voice participation	2	1056	December 16, 2020
Enable Sinhala on contributing to collect and review dataset for Mozilla Common Voice Common Voice l10n	3	1851	April 8, 2019
Languages addressed Common Voice	24	3911	May 15, 2018
Polish language ready to recording and reviewing recordings Common Voice participation , learning , sentence-collection	3	1447	August 26, 2019

New Language Support in Common Voice

Related topics