If we couldn’t find any Creative Commons licensed material, can we translated from the English dataset? I mean, we could also use this dataset for machine translation later on, There are many datasets, but none of them seems to fit the required, most of them are under the MIT license.
Sentence collection tool development topic
Which language are you having problems finding public domain sentences?
From my observations the most efficient way to contribute to the corpus is find a lot of public domain corpus somewhere.
But, I know that for some languages this is not possible so own creation is preferred to translation to have more natural sentences and cover more sounds. Also, I personally think that it takes less time to create new sentences than translating the English ones.
I’m having some problems trying to find sentences for Brazilian Portuguese, there is a corpus that was used by a University in a work that they did using Julius, the name of the group is Fala Brasil, they’ve had being doing speech research for a few years, I didn’t had the oportunity to engage directly with them, they have as said before lots of sources I’ll link them here: http://labvis.ufpa.br/falabrasil/downloads/
I couldn’t find details about license.
Update: I’ve find the license in this README.
About the sentence collection, I think I’ll write some of them myself, but from a legal point of view, how to prove that I wrote them? That’s my concern.
Also could you link the sentences used in Common Voice in English, I’ll translate some of them, one of the problems when creating new sentences, at least for me is that I ran out of ideas, so translating would be easier for me.
When submitting the sentences you declare you have created and release them under public domain or that you have found them from a specific source that was also public domain.
I’ve just updated the sentence collection How to page to include some hints on where to find public domain material, additional ideas welcomed.
(It should be live in the next deployment)
In our country, 1) the government press release is all public domain. and 2) we also asking community people to donate their article with CC0 licenses. These are the main resource our locale sentences came from.
Also, you can NOT use CC licensed materials. They are NOT CC0/Public Domain compatible. (CC is different from CC0).
Translate current sentences from other locales may also be a good way for collecting sentences. Just make sure they translated “fluence enough” like real sentences people will speak out in daily life.
The sentence collection tool how to now has some guidance on where to find public domain material:
About those hints,
1) Anyone knows any documentation in how to get from common crawl, only cc0 materials?
I was looking into the site, but there is many content, and i would like some help to getting started faster.
2) I was thinking if the content (the subtitles) from ‘opensubtitles’ may not be a ‘derivative work’ from the movies itself.
If this is the case, we may have problems using it, right?
Here’s a snippet on how we got CC-0 text for Kyrgyz. Full article
TL;DR - We asked a news publication to donate their text.
Currently, all Kyrgyz text sentences used for this project come from the well-known Kyrgyz language news source Kloop.kg. The founder of Kloop.kg, Bektour Iskender - a proponent of an open-internet and the Create Commons - allowed use of Kyrgyz language articles from Kloop to be distributed under CC-0. As such, when the user reads a sentence for Kyrgyz Common Voice, they are actually reading news from Kloop.kg. This is a major win for the Kyrgyz language and the open internet, because finding CC-0 text for Common Voice is typically the most difficult task in adding a new language. At least 5,000 different sentences should be initially recorded, and most books and online news (such as BBC Kyrgyz) are not available under CC-0.
After the text was automatically downloaded from Kloop (via this Python script), the text was cleaned (all foreign words, numbers, abbreviations were removed) and sentences of an appropriate length were selected. Ideally each recording should be about 5 seconds long. More text can be added later, such that there is more diversity in the kinds of sentences read. Diversity is important for Common Voice, because good speech technologies should recognize the speech of people speaking with different accents about different topics.
A good way I found is to use the fact that corporations like facebook, google etc must now allow you to obtain a copy of your own data because of GDPR. This allows you to download all your chat logs and such, which you can easily extract all your messages, review them and then share them with Common Voice as CC0. A good tip however is to mix up the messages and mix it with another CC0 source in order to keep a little bit of anonymity if your worried.
Look for your own past work when it comes to finding large amount of text, its fairly easier since we’ve all had to write dissertations or used instant messaging services. It just needs a bit of review and its perfect CC0 content because of its diversity.
Hi all, I understand the requirement for CC0 so that no licensing restrictions may limit future outcomes of the DeepSpeech models. I do wonder though if this isn’t unncessarily holding back more rapid progress of this great project.
FB’s LASER project has just reached a major milestone for rapidly translating phrases from 93 languages (see this detailed post and this full paper). All code is released on https://github.com/facebookresearch/LASER under
CC-BY-NC. This leverages the huge Tatoeba corpus, some of which is CC0 but most of it is under
CC-BY. FB’s advances wouldn’t have been possible without this great resource.
CC-BY places no restrictions other than needing to attribute that some source material may have come from tatoeba.org, what are the arguments against allowing this kind of source/license? It would certainly allow jump-starting some languages where collected sentences are still at 0 (e.g., Arabic). For example, Tatoeba has 31,481 sentences available in Arabic, but none of them are
CC0, all are
CC-BY. I ran a quick summary on the
sentences.csv that can be downloaded; here are the number of sentences by language.
I’d love to understand better what the hard arguments are for excluding
CC-BY given such great resources.
We are going to have a Common Voice meeting next week where we want to outline some strategy for 2019.
My understanding is that, at the time, cc-0 license was selected because provided us more freedom in the future to engage with more people using the datasets, but definitely that is something we should re-evaluate with the current situation, experience and goals for the coming future.
I’ll make sure the license topic is discussed with the team and outcomes of the conversation are presented to the community for feedback.
@nukeador Have you decide the meeting time?
Im really interested in this especially for Arabic.
as @tinok said about Arabic and having No CC0 but many CC-BY.
my question is we have many many old Arabic proverbs (without knowing the Source) , can I use them ? - I will use The Arabic MSA-
The meeting is going to be an in-person 3 days meeting in the Berlin office, just to coordinate the staff teams.
oh sorry i thought it will be online
Thank you and we will wait the outcome, and hope to find a solution for not having CC0 .
@nukeador That’s great to hear, thank you for the update.
I would say anything you write is CC0 (if you choose to license it as such!). So anything you write from memory, including proverbs, can be added to the sentence collector.
Similarly, the simple sentences in Tatoeba may be licensed as
CC-BY, no one owns the phrase “ذهبنا إلى لندن العام الماضي.,ذهبنا الى لندن العام الماضي” (“Last year we traveled to London”) – regardless which book or website it comes from. I think the copyright applies to the entire database, not individual sentences.
It would be nice if there was some legal guidance on how to use large
CC-BY resources for various languages without each contributor needing to make a moral judgment as to whether they are somehow pirating something.
Hi, any updates? Can we use the Tatoeba corpus or not? They have a lot of conversational sentences.
Unfortunately we can’t, their license is not compatible with Public Domain CC-0. We are working in other alternatives to help gather public domain sentences, we have ideas we will publish in the coming days with a summary of the Berlin meetup.