Licensing and contribution to Common Voice

I work with a research group at Reykjavík University and we are interested in contributing to Common Voice. There is a fairly large amount of data for the Icelandic language connected to our projects that could potentially be added to a collection such as this, either directly or by adapting it in some way, but there could be licensing issues. I suppose it boils down to this:

  1. Can CC-BY content, or other permissively licenced material that is not completely in the public domain, be used in any way by the Common Voice project?
  2. Is there a way to extract data from large collections so that the pieces aren’t copyrightable? Are e.g. many small text fragments from a book collection, news articles, etc., copyrightable if the original source cannot be reconstructed from the collection? Can they form the basis for a public-domain work, or would they merely fall under fair use, with the original author of individual fragments still holding an interest? Do some of you have experience in this area?
1 Like

I am also interested in the answer to this question.
We found some CC-BY sources for Romanian language that are perfect and would like to use them.

For Q1, I believe they’re only able to use CC-0 / fully PD works, but would be best to get an official answer.

One possible option might be to find a way to easily connect to the CC-BY data - so it could still be hosted externally (eg on some resource where you already have it) but in a form that’s compatible/consistent with the other data and then a simple import script is included in the project.

For Q2, I’m not a lawyer, but presumably if one identifies co-occuring common sentences that appear across more than one book, it’s fair to use them, as the author couldn’t claim you’d copied it from their book - this isn’t necessarily 100% watertight, since maybe the first author simply didn’t get round to suing the other author yet, but when a sentence had no obscure words in it and was widely used, I suspect it would be fairly safe to rely on then. There is a practical matter that you’d need a copy of the copyright works to do the comparisons automatically :slightly_frowning_face:

To clarify: Yes, currently we only can use CC-0 Public Domain material.

Hi!

Just to follow up if this has changed at all. I have been looking at the same database as @krun and the data is mostly or all licensed with CC - by 4 or CC - by 3, requiring only a reference to the author. After speaking to a manager of the dataset he suggested that if a simple readme file, that would include the right remarks, would follow the download of the dataset on Common voice that would suffice the license.

@nukeador Is that a possibility now or in the near future? It would drastically solve the sentence collection for the foreseeable future for Icelandic.

As well we will shortly upload a few thousand sentences that are CC-0 to get going in Icelandic!

Hi,

This has not changed, we need sentences to be public domain CC-0.

But please, check this topic for reference about alternative strategies we are doing