Licensing and contribution to Common Voice


I work with a research group at Reykjavík University and we are interested in contributing to Common Voice. There is a fairly large amount of data for the Icelandic language connected to our projects that could potentially be added to a collection such as this, either directly or by adapting it in some way, but there could be licensing issues. I suppose it boils down to this:

  1. Can CC-BY content, or other permissively licenced material that is not completely in the public domain, be used in any way by the Common Voice project?
  2. Is there a way to extract data from large collections so that the pieces aren’t copyrightable? Are e.g. many small text fragments from a book collection, news articles, etc., copyrightable if the original source cannot be reconstructed from the collection? Can they form the basis for a public-domain work, or would they merely fall under fair use, with the original author of individual fragments still holding an interest? Do some of you have experience in this area?