If we couldn’t find any Creative Commons licensed material, can we translated from the English dataset? I mean, we could also use this dataset for machine translation later on, There are many datasets, but none of them seems to fit the required, most of them are under the MIT license.
Sentence collection tool development topic
Which language are you having problems finding public domain sentences?
From my observations the most efficient way to contribute to the corpus is find a lot of public domain corpus somewhere.
But, I know that for some languages this is not possible so own creation is preferred to translation to have more natural sentences and cover more sounds. Also, I personally think that it takes less time to create new sentences than translating the English ones.
I’m having some problems trying to find sentences for Brazilian Portuguese, there is a corpus that was used by a University in a work that they did using Julius, the name of the group is Fala Brasil, they’ve had being doing speech research for a few years, I didn’t had the oportunity to engage directly with them, they have as said before lots of sources I’ll link them here: http://labvis.ufpa.br/falabrasil/downloads/
I couldn’t find details about license.
Update: I’ve find the license in this README.
About the sentence collection, I think I’ll write some of them myself, but from a legal point of view, how to prove that I wrote them? That’s my concern.
Also could you link the sentences used in Common Voice in English, I’ll translate some of them, one of the problems when creating new sentences, at least for me is that I ran out of ideas, so translating would be easier for me.
When submitting the sentences you declare you have created and release them under public domain or that you have found them from a specific source that was also public domain.
I’ve just updated the sentence collection How to page to include some hints on where to find public domain material, additional ideas welcomed.
(It should be live in the next deployment)
In our country, 1) the government press release is all public domain. and 2) we also asking community people to donate their article with CC0 licenses. These are the main resource our locale sentences came from.
Also, you can NOT use CC licensed materials. They are NOT CC0/Public Domain compatible. (CC is different from CC0).
Translate current sentences from other locales may also be a good way for collecting sentences. Just make sure they translated “fluence enough” like real sentences people will speak out in daily life.
The sentence collection tool how to now has some guidance on where to find public domain material:
About those hints,
1) Anyone knows any documentation in how to get from common crawl, only cc0 materials?
I was looking into the site, but there is many content, and i would like some help to getting started faster.
2) I was thinking if the content (the subtitles) from ‘opensubtitles’ may not be a ‘derivative work’ from the movies itself.
If this is the case, we may have problems using it, right?
Here’s a snippet on how we got CC-0 text for Kyrgyz. Full article
TL;DR - We asked a news publication to donate their text.
Currently, all Kyrgyz text sentences used for this project come from the well-known Kyrgyz language news source Kloop.kg. The founder of Kloop.kg, Bektour Iskender - a proponent of an open-internet and the Create Commons - allowed use of Kyrgyz language articles from Kloop to be distributed under CC-0. As such, when the user reads a sentence for Kyrgyz Common Voice, they are actually reading news from Kloop.kg. This is a major win for the Kyrgyz language and the open internet, because finding CC-0 text for Common Voice is typically the most difficult task in adding a new language. At least 5,000 different sentences should be initially recorded, and most books and online news (such as BBC Kyrgyz) are not available under CC-0.
After the text was automatically downloaded from Kloop (via this Python script), the text was cleaned (all foreign words, numbers, abbreviations were removed) and sentences of an appropriate length were selected. Ideally each recording should be about 5 seconds long. More text can be added later, such that there is more diversity in the kinds of sentences read. Diversity is important for Common Voice, because good speech technologies should recognize the speech of people speaking with different accents about different topics.
A good way I found is to use the fact that corporations like facebook, google etc must now allow you to obtain a copy of your own data because of GDPR. This allows you to download all your chat logs and such, which you can easily extract all your messages, review them and then share them with Common Voice as CC0. A good tip however is to mix up the messages and mix it with another CC0 source in order to keep a little bit of anonymity if your worried.
Look for your own past work when it comes to finding large amount of text, its fairly easier since we’ve all had to write dissertations or used instant messaging services. It just needs a bit of review and its perfect CC0 content because of its diversity.