📖 Readme: How to see my language on Common Voice

Most languages that reached enought number of sentences to be in the voice phase (more than 5000) were the ones that provided that many strings during the campaign we run a few months ago. Unfortunately Spanish did not gather a lot (just a few hundreds).

There are other languages that were using github pull requests but currently we have found that some of them will need a cleanup in order to be useful (for the machine learning engine) and that’s why from now on we want to make sure the sentences we include are properly reviewed to avoid having to do another cleanup in the future.

The good news is that this should not stop any efforts collecting sentences, in early January, as soon as we have the tool ready we will be able to mass import them and use the tool for doing a proper review and approval.

A post was split to a new topic: Multilanguage site localization

Sorry for nitpicking, but CC0 is not Public Domain, according to Creative Commons wiki:
https://wiki.creativecommons.org/wiki/CC0_FAQ
I think it could be more accurate to say that the project accepts sentences from CC0 and Public domain sources.

1 Like

A post was split to a new topic: Can a folk tales compilation/book be a valid source?

Hi @jumasheff I consulted with our legal team to make sure I was giving you the best answer and they let me know that CC0 is a dedication to the public domain (to the maximum extent possible). In fact, the license is called “Public Domain Dedication.” Let me know if you have any specific questions about this or would like further clarification.

1 Like

I want to note that we have just launched the sentence collection tool

In Spanish we have collected enough sentences and reviewed them and translated the website, but the language is still not enabled for voice recording. When will Spanish be enabled for voice recording?

That’s awesome to hear! Atm I still have to run a script & redeploy to make a language available. I’ll try to get that done this week.

Salute! Pote io traducer iste communication in interlingua? Gratias. Vos pote responder in: anglese, italiano, espaniol, francese o portugese. Gratias.

1 Like

A post was split to a new topic: Add tamazigh language

2 posts were merged into an existing topic: Hindi locale

Since the bn-Wikipedia scraping is on hold for the rust-punkt issue. Meanwhile, I would like to know about the legality of using sentences from other public domain sources for the mainly for the Bengali languge such as:


which may be a bit problematic legally but the quality of the data is good in the sense that its scrapped from the web hence contains actual coloquial language. And this corpus contains data for most languages.

The other is
https://cse.iitkgp.ac.in/~pabitra/shruti_corpus.html

this is another CC0 data set collected by a university to help evaluation of automatic speech recognition systems. Hence can be a good validation data set.

OSCAR was actually already suggested in the past, but for legal reasons a decision has been made not to utilise it. See Using OSCAR corpus sentences . I would have to invest a bit more time into checking the second source you provided to be able to provide some valid input regarding that, but at first glance it seems more like an already complete unrelated voice dataset, and not like a great source of sentences for common voice.

Yeah its an already completed dataset. Can’t it be used as a test data set?

Sure it can, but it should not become part of common voice. You can link to it from common voice, but including the data into common voice would only cause confusion and potentially data duplication for multi-dataset applications.

@Adrijaned Is wikisource a good source ? I see that public domain means different in different countries, Can you clarify if the text is public domain in US can be used for common voice ? This can be a good source, https://bn.wikisource.org/wiki/লেখক:রবীন্দ্রনাথ_ঠাকুর but adding them manually through sentence collector will be a huge work :confused:

WikiSource can be a good source, and I agree that manually pulling sentences from there is tedious (been there, done that). If you could make a script to automatically extract viable sentences from WikiSource, then those could in theory go through the same validation process as Wikipedia extractions. So, if you can make such a script to go through wikisource and extract viable sentences one per line into a new file, then go ahead and we will sort something out later. Otherwise, writing such script will take a fair bit of effort and time from other people involved, so it could very well still be a long while until that gets done. Have you though about getting a Wikipedia extraction for bengali done in the meantime instead? That one is somewhat functional and should be significantly less effort. (https://github.com/Common-Voice/cv-sentence-extractor)

Let me try out the wiki extractor once more, seems like there is a fix for sentence terminator, will let you know the results in couple of days

As you can see from the discussions here the issue is that the rust-punk tokenizer is not even trained for languages like Bengali, it falls back to English for tokenizing, as explained by @Arijit_Mukherjee before. Until and unless a new tokenizer is introduced with wider support Wikipedia extraction for most languages seems like a futile exercise. I don’t think the addition of a choice of punctuation character will solve all problems.

Did all of that back in 2018 and still waiting to get a response from the moderators.

1 Like