Allow copyrighted text with a take down notice

My understanding is that Common Voice right now is in maintenance mode, but I am suggesting this feature which will be beneficial specially for low resource languages, if we could allow copyrighted text with a takedown notice.
I could work on this feature, and submit a pull request.

I have 1.2 million sentences in Abkhazian and possibly other languages, but I can’t use it because it’s copyrighted, we can have a takedown notice so the contributor will understand the risk, they will have the choice to either contribute to this copyrighted data set or the more reliable CC0 data set.


I understand where you are aiming with this, but to me this unfortunately sounds an awful lot like suggesting to intentionally break a law, and hope no one complaints.

Definitely, we don’t want to get anyone upset! If the owner doesn’t want their text to be used for research and non-commercial use, we should protect their privacy, and the text will be taken down and related voice records deleted.

Have you thought about trying to contact the authors of your texts in advance and securing their permission first? If they agree to release their texts in some form for this projects, there will be no further problem including them, and if they disagree, still better than if you included it first without their permission, and then got sued for copyright infringement.

1 Like

You are right!
That’s a clean solution, so they would probably allow the usage of the text under some terms.
The next step is how can we include all this text to Common Voice?

Just make sure you can get this released as CC-0, importing is not a big issue.

I did this for Esperanto, I asked blogs and web magazines and most of them were happy to donate sentences to the project.

I always make clear that I will only use sentences with fewer than 15? words and that the dataset will be released as CC0. So there is no recognizable text only a list of sentences that they give away for free.

1 Like

Depending on the size and quality of the individual sources. If there is one big source or potentially several smaller with highly comparable quality, you could just extract them all from that source into one file, one sentence per line, and submit it in a pull request to the common voice repository. Then you would have to get preferably at least two of three people to do quality assurance of those sentences, if you ask how in the PR someone will definitely gladly guide you. If the sources you have in mind are mostly smaller individual works, e.g. articles from some blogs, or, depending, even individual books, I’m afraid you will just have to import them into the sentence collector, and pass them through the normal process.

Did you get a handwritten release form for these sentences? I did that for the first 5000 sentences that I have submitted here.

So there is no recognizable text only a list of sentences that they give away for free.

Are you sure of this?! Common Voice could barely collect 3 sentences per article from Wikipedia due to copyright limitations.

Well I saw how much sentences disappeared after I filtered them by legth, foreign letters, structure and so on. Often I could only use around a third of a text. But this wasn’t a legal argument, just an argument to take away some fears of some authors. Most authors care about their texts, not so much about a alphabeical list of sentences from their texts. I just had the feeling that this argument helps to get the permission.

Have you tried to get them to sign a release form?Here is a link to a form that I have used previously.

No I just saved the mail where they confirmed their consent. This might be a little risky, but it was enough for me.