Since the Bulgarian website is online already and we’ll start work on the sentence collection, one more question: What licenses are allowed? I’ve read on some other places about CC0 & Public domain? I guess that includes sentences authored by a submitter willing to license them using one of the two options above. Is attribution/author or source information required?
Yes, license needs to be Public Domain (CC-0) and we ask about the source so we can check it.
Thanks!
As other suggestions because I worked on adding the Italian language with Mozilla Italia community:
- At least 7000 sentences
- No badwords because the website is open to everyone and there is no regulations
- Check if the sentences can fit 20 seconds that is the maximum of the recording
- Validate the sentences based on your language and grammar rules
- Take them from different context: books, school stuff (people donate their thesis as example), religion, IT, novels, interviews, etc.
- Review and make a workflow where there are no blocking people
PS: no wikipedia to avoid issues with licensing and be free to manipulate based on your needs.
Here’s a blog post on how we added a new language to Common Voice, from start to finish.
http://jrmeyer.github.io/misc/2019/05/29/mozilla-kyrgyz-common-voice.html
Hi,
Is the step 3 still blocked? I have plenty of Spanish sentences and I would like to upload them.
The source are books of public domain Spanish language literature, selecting specially dialog phrases.
On the other hand, are there other requirements about the length of the sentences, or grouping them by some categories, or type of words in the phrases or the named entities or the numbers or the dates, or the use of slang or anything else?.
Best regards,
Mar
Ideally we would have a beta version of the sentence collection tool to test before the end of the year, if the QA of that beta is satisfactory we can start using it to collect and review sentences.
And I have to say I understand it’s frustrating but please, keep collecting sentences so we can submit them through the tool as soon as it’s ready.
Yes, we are working with the Deep Speech team to have a document with all the requirements, and we want the sentence collection tool to enforce some of them so we don’t have to manually check them.
Thanks a lot Rubén,
I have some other questions:
- How did the other languages managed? is there any workaround?
- Why is this procedure and software mandatory now for Spanish language?
Understand that is quite weird that the third (maybe second or fourth) most spoken language in the world is not included yet…
Best Regards,
Mar
Most languages that reached enought number of sentences to be in the voice phase (more than 5000) were the ones that provided that many strings during the campaign we run a few months ago. Unfortunately Spanish did not gather a lot (just a few hundreds).
There are other languages that were using github pull requests but currently we have found that some of them will need a cleanup in order to be useful (for the machine learning engine) and that’s why from now on we want to make sure the sentences we include are properly reviewed to avoid having to do another cleanup in the future.
The good news is that this should not stop any efforts collecting sentences, in early January, as soon as we have the tool ready we will be able to mass import them and use the tool for doing a proper review and approval.
Sorry for nitpicking, but CC0 is not Public Domain, according to Creative Commons wiki:
https://wiki.creativecommons.org/wiki/CC0_FAQ
I think it could be more accurate to say that the project accepts sentences from CC0 and Public domain sources.
Hi @jumasheff I consulted with our legal team to make sure I was giving you the best answer and they let me know that CC0 is a dedication to the public domain (to the maximum extent possible). In fact, the license is called “Public Domain Dedication.” Let me know if you have any specific questions about this or would like further clarification.
I want to note that we have just launched the sentence collection tool
In Spanish we have collected enough sentences and reviewed them and translated the website, but the language is still not enabled for voice recording. When will Spanish be enabled for voice recording?
That’s awesome to hear! Atm I still have to run a script & redeploy to make a language available. I’ll try to get that done this week.
Salute! Pote io traducer iste communication in interlingua? Gratias. Vos pote responder in: anglese, italiano, espaniol, francese o portugese. Gratias.
Since the bn-Wikipedia scraping is on hold for the rust-punkt issue. Meanwhile, I would like to know about the legality of using sentences from other public domain sources for the mainly for the Bengali languge such as:
which may be a bit problematic legally but the quality of the data is good in the sense that its scrapped from the web hence contains actual coloquial language. And this corpus contains data for most languages.
The other is
https://cse.iitkgp.ac.in/~pabitra/shruti_corpus.html
this is another CC0 data set collected by a university to help evaluation of automatic speech recognition systems. Hence can be a good validation data set.
OSCAR was actually already suggested in the past, but for legal reasons a decision has been made not to utilise it. See Using OSCAR corpus sentences . I would have to invest a bit more time into checking the second source you provided to be able to provide some valid input regarding that, but at first glance it seems more like an already complete unrelated voice dataset, and not like a great source of sentences for common voice.