I work with a research group at Reykjavík University and we are interested in contributing to Common Voice. There is a fairly large amount of data for the Icelandic language connected to our projects that could potentially be added to a collection such as this, either directly or by adapting it in some way, but there could be licensing issues. I suppose it boils down to this:
Can CC-BY content, or other permissively licenced material that is not completely in the public domain, be used in any way by the Common Voice project?
Is there a way to extract data from large collections so that the pieces aren’t copyrightable? Are e.g. many small text fragments from a book collection, news articles, etc., copyrightable if the original source cannot be reconstructed from the collection? Can they form the basis for a public-domain work, or would they merely fall under fair use, with the original author of individual fragments still holding an interest? Do some of you have experience in this area?
For Q1, I believe they’re only able to use CC-0 / fully PD works, but would be best to get an official answer.
One possible option might be to find a way to easily connect to the CC-BY data - so it could still be hosted externally (eg on some resource where you already have it) but in a form that’s compatible/consistent with the other data and then a simple import script is included in the project.
For Q2, I’m not a lawyer, but presumably if one identifies co-occuring common sentences that appear across more than one book, it’s fair to use them, as the author couldn’t claim you’d copied it from their book - this isn’t necessarily 100% watertight, since maybe the first author simply didn’t get round to suing the other author yet, but when a sentence had no obscure words in it and was widely used, I suspect it would be fairly safe to rely on then. There is a practical matter that you’d need a copy of the copyright works to do the comparisons automatically
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
4
To clarify: Yes, currently we only can use CC-0 Public Domain material.
Just to follow up if this has changed at all. I have been looking at the same database as @krun and the data is mostly or all licensed with CC - by 4 or CC - by 3, requiring only a reference to the author. After speaking to a manager of the dataset he suggested that if a simple readme file, that would include the right remarks, would follow the download of the dataset on Common voice that would suffice the license.
@nukeador Is that a possibility now or in the near future? It would drastically solve the sentence collection for the foreseeable future for Icelandic.
As well we will shortly upload a few thousand sentences that are CC-0 to get going in Icelandic!
nukeador
(Rubén Martín [❌ taking a break from Mozilla])
6
Hi,
This has not changed, we need sentences to be public domain CC-0.
But please, check this topic for reference about alternative strategies we are doing