📖 Readme: How to see my language on Common Voice

That’s awesome to hear! Atm I still have to run a script & redeploy to make a language available. I’ll try to get that done this week.

Salute! Pote io traducer iste communication in interlingua? Gratias. Vos pote responder in: anglese, italiano, espaniol, francese o portugese. Gratias.

1 Like

A post was split to a new topic: Add tamazigh language

2 posts were merged into an existing topic: Hindi locale

Since the bn-Wikipedia scraping is on hold for the rust-punkt issue. Meanwhile, I would like to know about the legality of using sentences from other public domain sources for the mainly for the Bengali languge such as:


which may be a bit problematic legally but the quality of the data is good in the sense that its scrapped from the web hence contains actual coloquial language. And this corpus contains data for most languages.

The other is
https://cse.iitkgp.ac.in/~pabitra/shruti_corpus.html

this is another CC0 data set collected by a university to help evaluation of automatic speech recognition systems. Hence can be a good validation data set.

OSCAR was actually already suggested in the past, but for legal reasons a decision has been made not to utilise it. See Using OSCAR corpus sentences . I would have to invest a bit more time into checking the second source you provided to be able to provide some valid input regarding that, but at first glance it seems more like an already complete unrelated voice dataset, and not like a great source of sentences for common voice.

Yeah its an already completed dataset. Can’t it be used as a test data set?

Sure it can, but it should not become part of common voice. You can link to it from common voice, but including the data into common voice would only cause confusion and potentially data duplication for multi-dataset applications.

@Adrijaned Is wikisource a good source ? I see that public domain means different in different countries, Can you clarify if the text is public domain in US can be used for common voice ? This can be a good source, https://bn.wikisource.org/wiki/লেখক:রবীন্দ্রনাথ_ঠাকুর but adding them manually through sentence collector will be a huge work :confused:

WikiSource can be a good source, and I agree that manually pulling sentences from there is tedious (been there, done that). If you could make a script to automatically extract viable sentences from WikiSource, then those could in theory go through the same validation process as Wikipedia extractions. So, if you can make such a script to go through wikisource and extract viable sentences one per line into a new file, then go ahead and we will sort something out later. Otherwise, writing such script will take a fair bit of effort and time from other people involved, so it could very well still be a long while until that gets done. Have you though about getting a Wikipedia extraction for bengali done in the meantime instead? That one is somewhat functional and should be significantly less effort. (https://github.com/Common-Voice/cv-sentence-extractor)

Let me try out the wiki extractor once more, seems like there is a fix for sentence terminator, will let you know the results in couple of days

As you can see from the discussions here the issue is that the rust-punk tokenizer is not even trained for languages like Bengali, it falls back to English for tokenizing, as explained by @Arijit_Mukherjee before. Until and unless a new tokenizer is introduced with wider support Wikipedia extraction for most languages seems like a futile exercise. I don’t think the addition of a choice of punctuation character will solve all problems.

Did all of that back in 2018 and still waiting to get a response from the moderators.

1 Like

Actually we have a lot of books from really famous writers(read Nobel laureates) that are in public domain. As I explained before their works are made public by the Indian govt 60 years after their death. Those works are present in archive.org. Can you do a bulk dump after going through the statistical quality check guidelines? And @heyhillary can you make a read me thread seperately for Bangla. Otherwise this is gonna get geared towards one language.

Yeah - I can would you be happy to be the moderator for Bangla sub-thread on discourse ?

Yes that would be great! Thank you. And if you have a moderator guidelines, please let me know.

Hey Mainak, I’ve requested getting a Bengali/Bangal subthread for discourse and will update you when it’s available.

Also this information has been also added to the Common Voice Platform on the About Page. Bengali isn’t fully localised: https://commonvoice.mozilla.org/bn/about

If you would be happy to localise this page please create a pontoon account -> then localise via this link https://pontoon.mozilla.org/bn/common-voice/

If you have any questions please let me know.

Thanks @heyhillary for all the support! You can add Tahmid Hossein and আফতাবুজ্জামান as moderators on the pontoon platform. They work with wiki for translation, and have very good grasp of the language. If you look in the Matrix Chat, I have their written consent. However that might be in Bengali. Another minor thing the language is called ‘Bangla‘ and Bengal is the region its spoken in. Thanks again!

Also I would need some clarity on the issue of Public Domain books. Most of the previous discussion on this issue was with @nukeador, I am not sure if he is around anymore. The thing is, for the Bangla language we have already collected 350+ hrs of voice with just about 1 lakh 25 thousand sentences, which is about three fold of what is recommended. We desperately need new CC0 sentences as we wouldn‘t like to kill the enthusiasm in the volunteers saying „you have to wait until we have more sentences“. @mkohler @heyhillary. If anyone can provide any direction that would be of great help! Also sorry for bothering you during the weekend and happy easter in advance!

1 Like

Hey Mainak,

I would like to suggest doing one of the following ideas:

  1. Host a community sentence collection campaign - as a way to highlight how members can generate and highlight cc0 sentences. You could get funding support via the Mozilla Reps.

  2. Have you also considered reaching out to Bangla written news outlets or cultural institutions - to donate sentences from the online archives into cc0 - you can use the cc0 waiver process

  3. Although not official advice - some communities have translated and adapted corpus texts from languages similar to theirs. They then used automatic translations and had reviewers check the sentences and then added them as a bulk submission.

Sorry for the delay in response - I have been on PTO.

1 Like