Licensing and contribution to Common Voice

krun · September 13, 2018, 3:10pm

I work with a research group at Reykjavík University and we are interested in contributing to Common Voice. There is a fairly large amount of data for the Icelandic language connected to our projects that could potentially be added to a collection such as this, either directly or by adapting it in some way, but there could be licensing issues. I suppose it boils down to this:

Can CC-BY content, or other permissively licenced material that is not completely in the public domain, be used in any way by the Common Voice project?
Is there a way to extract data from large collections so that the pieces aren’t copyrightable? Are e.g. many small text fragments from a book collection, news articles, etc., copyrightable if the original source cannot be reconstructed from the collection? Can they form the basis for a public-domain work, or would they merely fall under fair use, with the original author of individual fragments still holding an interest? Do some of you have experience in this area?

eugen · December 4, 2018, 3:21pm

I am also interested in the answer to this question.
We found some CC-BY sources for Romanian language that are perfect and would like to use them.

nmstoker · December 12, 2018, 11:04pm

For Q1, I believe they’re only able to use CC-0 / fully PD works, but would be best to get an official answer.

One possible option might be to find a way to easily connect to the CC-BY data - so it could still be hosted externally (eg on some resource where you already have it) but in a form that’s compatible/consistent with the other data and then a simple import script is included in the project.

For Q2, I’m not a lawyer, but presumably if one identifies co-occuring common sentences that appear across more than one book, it’s fair to use them, as the author couldn’t claim you’d copied it from their book - this isn’t necessarily 100% watertight, since maybe the first author simply didn’t get round to suing the other author yet, but when a sentence had no obscure words in it and was widely used, I suspect it would be fairly safe to rely on then. There is a practical matter that you’d need a copy of the copyright works to do the comparisons automatically

nukeador · January 9, 2019, 11:30am

To clarify: Yes, currently we only can use CC-0 Public Domain material.

david.e.mollberg · June 12, 2019, 10:23am

Hi!

Just to follow up if this has changed at all. I have been looking at the same database as @krun and the data is mostly or all licensed with CC - by 4 or CC - by 3, requiring only a reference to the author. After speaking to a manager of the dataset he suggested that if a simple readme file, that would include the right remarks, would follow the download of the dataset on Common voice that would suffice the license.

@nukeador Is that a possibility now or in the near future? It would drastically solve the sentence collection for the foreseeable future for Icelandic.

As well we will shortly upload a few thousand sentences that are CC-0 to get going in Icelandic!

nukeador · June 12, 2019, 12:40pm

Hi,

This has not changed, we need sentences to be public domain CC-0.

But please, check this topic for reference about alternative strategies we are doing

Topic		Replies	Views
Allow copyrighted text with a take down notice Common Voice participation	11	979	September 19, 2020
Problems finding public domain sentences Common Voice sentence-collection	26	3030	June 10, 2019
Copyrighted content for sentences Common Voice	4	419	August 18, 2023
Common voice sentences are the opposite of "common" Common Voice participation , sentence-collection , feedback , issue	27	3880	September 7, 2024
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3751	September 11, 2019

Licensing and contribution to Common Voice

Related topics