📖 Readme: How to see my language on Common Voice

Hi @jumasheff I consulted with our legal team to make sure I was giving you the best answer and they let me know that CC0 is a dedication to the public domain (to the maximum extent possible). In fact, the license is called “Public Domain Dedication.” Let me know if you have any specific questions about this or would like further clarification.

1 Like

I want to note that we have just launched the sentence collection tool

In Spanish we have collected enough sentences and reviewed them and translated the website, but the language is still not enabled for voice recording. When will Spanish be enabled for voice recording?

That’s awesome to hear! Atm I still have to run a script & redeploy to make a language available. I’ll try to get that done this week.

Salute! Pote io traducer iste communication in interlingua? Gratias. Vos pote responder in: anglese, italiano, espaniol, francese o portugese. Gratias.

1 Like

A post was split to a new topic: Add tamazigh language

2 posts were merged into an existing topic: Hindi locale

Since the bn-Wikipedia scraping is on hold for the rust-punkt issue. Meanwhile, I would like to know about the legality of using sentences from other public domain sources for the mainly for the Bengali languge such as:


which may be a bit problematic legally but the quality of the data is good in the sense that its scrapped from the web hence contains actual coloquial language. And this corpus contains data for most languages.

The other is
https://cse.iitkgp.ac.in/~pabitra/shruti_corpus.html

this is another CC0 data set collected by a university to help evaluation of automatic speech recognition systems. Hence can be a good validation data set.

OSCAR was actually already suggested in the past, but for legal reasons a decision has been made not to utilise it. See Using OSCAR corpus sentences . I would have to invest a bit more time into checking the second source you provided to be able to provide some valid input regarding that, but at first glance it seems more like an already complete unrelated voice dataset, and not like a great source of sentences for common voice.

Yeah its an already completed dataset. Can’t it be used as a test data set?

Sure it can, but it should not become part of common voice. You can link to it from common voice, but including the data into common voice would only cause confusion and potentially data duplication for multi-dataset applications.

@Adrijaned Is wikisource a good source ? I see that public domain means different in different countries, Can you clarify if the text is public domain in US can be used for common voice ? This can be a good source, https://bn.wikisource.org/wiki/লেখক:রবীন্দ্রনাথ_ঠাকুর but adding them manually through sentence collector will be a huge work :confused:

WikiSource can be a good source, and I agree that manually pulling sentences from there is tedious (been there, done that). If you could make a script to automatically extract viable sentences from WikiSource, then those could in theory go through the same validation process as Wikipedia extractions. So, if you can make such a script to go through wikisource and extract viable sentences one per line into a new file, then go ahead and we will sort something out later. Otherwise, writing such script will take a fair bit of effort and time from other people involved, so it could very well still be a long while until that gets done. Have you though about getting a Wikipedia extraction for bengali done in the meantime instead? That one is somewhat functional and should be significantly less effort. (https://github.com/Common-Voice/cv-sentence-extractor)

Let me try out the wiki extractor once more, seems like there is a fix for sentence terminator, will let you know the results in couple of days

As you can see from the discussions here the issue is that the rust-punk tokenizer is not even trained for languages like Bengali, it falls back to English for tokenizing, as explained by @Arijit_Mukherjee before. Until and unless a new tokenizer is introduced with wider support Wikipedia extraction for most languages seems like a futile exercise. I don’t think the addition of a choice of punctuation character will solve all problems.

Did all of that back in 2018 and still waiting to get a response from the moderators.

1 Like

Actually we have a lot of books from really famous writers(read Nobel laureates) that are in public domain. As I explained before their works are made public by the Indian govt 60 years after their death. Those works are present in archive.org. Can you do a bulk dump after going through the statistical quality check guidelines? And @heyhillary can you make a read me thread seperately for Bangla. Otherwise this is gonna get geared towards one language.

Yeah - I can would you be happy to be the moderator for Bangla sub-thread on discourse ?

Yes that would be great! Thank you. And if you have a moderator guidelines, please let me know.

Hey Mainak, I’ve requested getting a Bengali/Bangal subthread for discourse and will update you when it’s available.

Also this information has been also added to the Common Voice Platform on the About Page. Bengali isn’t fully localised: https://commonvoice.mozilla.org/bn/about

If you would be happy to localise this page please create a pontoon account -> then localise via this link https://pontoon.mozilla.org/bn/common-voice/

If you have any questions please let me know.