Problems finding public domain sentences

In our country, 1) the government press release is all public domain. and 2) we also asking community people to donate their article with CC0 licenses. These are the main resource our locale sentences came from.

Also, you can NOT use CC licensed materials. They are NOT CC0/Public Domain compatible. (CC is different from CC0).

Translate current sentences from other locales may also be a good way for collecting sentences. Just make sure they translated “fluence enough” like real sentences people will speak out in daily life.

The sentence collection tool how to now has some guidance on where to find public domain material:

https://common-voice.github.io/sentence-collector/#/how-to

About those hints,

1) Anyone knows any documentation in how to get from common crawl, only cc0 materials?
I was looking into the site, but there is many content, and i would like some help to getting started faster.

2) I was thinking if the content (the subtitles) from ‘opensubtitles’ may not be a ‘derivative work’ from the movies itself.
If this is the case, we may have problems using it, right?

Here’s a snippet on how we got CC-0 text for Kyrgyz. Full article

TL;DR - We asked a news publication to donate their text.

‘’’
Currently, all Kyrgyz text sentences used for this project come from the well-known Kyrgyz language news source Kloop.kg. The founder of Kloop.kg, Bektour Iskender - a proponent of an open-internet and the Create Commons - allowed use of Kyrgyz language articles from Kloop to be distributed under CC-0. As such, when the user reads a sentence for Kyrgyz Common Voice, they are actually reading news from Kloop.kg. This is a major win for the Kyrgyz language and the open internet, because finding CC-0 text for Common Voice is typically the most difficult task in adding a new language. At least 5,000 different sentences should be initially recorded, and most books and online news (such as BBC Kyrgyz) are not available under CC-0.

After the text was automatically downloaded from Kloop (via this Python script), the text was cleaned (all foreign words, numbers, abbreviations were removed) and sentences of an appropriate length were selected. Ideally each recording should be about 5 seconds long. More text can be added later, such that there is more diversity in the kinds of sentences read. Diversity is important for Common Voice, because good speech technologies should recognize the speech of people speaking with different accents about different topics.
‘’’

A good way I found is to use the fact that corporations like facebook, google etc must now allow you to obtain a copy of your own data because of GDPR. This allows you to download all your chat logs and such, which you can easily extract all your messages, review them and then share them with Common Voice as CC0. A good tip however is to mix up the messages and mix it with another CC0 source in order to keep a little bit of anonymity if your worried.

Look for your own past work when it comes to finding large amount of text, its fairly easier since we’ve all had to write dissertations or used instant messaging services. It just needs a bit of review and its perfect CC0 content because of its diversity.
@nukeador

1 Like

Hi all, I understand the requirement for CC0 so that no licensing restrictions may limit future outcomes of the DeepSpeech models. I do wonder though if this isn’t unncessarily holding back more rapid progress of this great project.

FB’s LASER project has just reached a major milestone for rapidly translating phrases from 93 languages (see this detailed post and this full paper). All code is released on https://github.com/facebookresearch/LASER under CC-BY-NC. This leverages the huge Tatoeba corpus, some of which is CC0 but most of it is under CC-BY. FB’s advances wouldn’t have been possible without this great resource.

Given that CC-BY places no restrictions other than needing to attribute that some source material may have come from tatoeba.org, what are the arguments against allowing this kind of source/license? It would certainly allow jump-starting some languages where collected sentences are still at 0 (e.g., Arabic). For example, Tatoeba has 31,481 sentences available in Arabic, but none of them are CC0, all are CC-BY. I ran a quick summary on the sentences.csv that can be downloaded; here are the number of sentences by language.

I’d love to understand better what the hard arguments are for excluding CC-BY given such great resources.

cc @nukeador @mhenretty

2 Likes

We are going to have a Common Voice meeting next week where we want to outline some strategy for 2019.

My understanding is that, at the time, cc-0 license was selected because provided us more freedom in the future to engage with more people using the datasets, but definitely that is something we should re-evaluate with the current situation, experience and goals for the coming future.

I’ll make sure the license topic is discussed with the team and outcomes of the conversation are presented to the community for feedback.

Cheers.

@nukeador Have you decide the meeting time?
Im really interested in this especially for Arabic.

as @tinok said about Arabic and having No CC0 but many CC-BY.

my question is we have many many old Arabic proverbs (without knowing the Source) , can I use them ? - I will use The Arabic MSA-

The meeting is going to be an in-person 3 days meeting in the Berlin office, just to coordinate the staff teams.

oh sorry i thought it will be online :slight_smile:
Thank you and we will wait the outcome, and hope to find a solution for not having CC0 .
Thanks

@nukeador That’s great to hear, thank you for the update.

I would say anything you write is CC0 (if you choose to license it as such!). So anything you write from memory, including proverbs, can be added to the sentence collector.

Similarly, the simple sentences in Tatoeba may be licensed as CC-BY, no one owns the phrase “ذهبنا إلى لندن العام الماضي.,ذهبنا الى لندن العام الماضي” (“Last year we traveled to London”) – regardless which book or website it comes from. I think the copyright applies to the entire database, not individual sentences.

It would be nice if there was some legal guidance on how to use large CC-BY resources for various languages without each contributor needing to make a moral judgment as to whether they are somehow pirating something.

1 Like

Hi, any updates? Can we use the Tatoeba corpus or not? They have a lot of conversational sentences.
cc @nukeador

Unfortunately we can’t, their license is not compatible with Public Domain CC-0. We are working in other alternatives to help gather public domain sentences, we have ideas we will publish in the coming days with a summary of the Berlin meetup.

Thank you for the answer. Yeah, it will be interesting to see the summary.

Maybe a useful approach to others: We started using translations from the English CommonVoice corpus as a source for adding sentences in Arabic.

This requires native speakers to confirm the accuracy of the translations (because we wouldn’t want blindly translated phrases to go to the sentence collector). But it’s a starting point that may work for other languages as well that struggle finding enough CC0 phrases.

@nukeador Is there any new guidance on using sentences from Tatoeba? The sentences we are considering for Arabic are licensed at CC-BY. But as I wrote above, I think the license applies to the entire database, not each individual sentence (all of which have been used in many places previously, including other copyrighted material).

We have analyzed the Tatoeba database and found 31,806 Arabic sentences. They are all good quality. Would randomly selecting 5,000 or any other number for inclusion in the sentence collector be a violation of the CC license?

It would be great to have definitive guidance since I’m sure many other people are finding and collecting sentences from other sources but are unsure about the legal questions (or simply go ahead regardless).

Maybe reaching out to Tatoeba would be possible so that including random subsets (rather than the entire database) would get an explicit exemption?

Later this week I’ll be posting an update on sentence collection from the team that I hope will help in this matter.

I uploaded around 14,000 modern standard Arabic sentences in the sentence collector that need verification.

Any update on the usage of CC-BY?

For now we should we stick with public domain. The update I posted was about the “fair-use” of some large sources of text and the work we started doing with wikipedia