How to deal with academic and public domain license for model usage

Just as the title we (for Italian) have a problem.
Right now the majority of datasets are from the academic world and they don’t have any license but need a citation of the paper.
So for the italian model https://github.com/MozillaItalia/DeepSpeech-Italian-Model/ we are avoiding them because we don’t know how to deal with them.

On https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-speech-to-text-engine/ are mentioned two academic dataset that have that issue, no license but citation required.

So my question is we can use them and release a public domain model? Or we need to mention that we are using and also the users that use the model itself?
We have the same problem for audio+text and text only dataset, also on using CC (also non-commercial) to generate a model.

I started also a discussion in Italian on reddit https://www.reddit.com/r/ItalyInformatica/comments/e6ffyg/licenze_open_source_e_paper_accademici/ to understand better the problem.

Because if we can use those stuff and license the model as public domain also if we are using to generate it resources from different sources with different license, will change our project because we will not have any limit.
The point we raised is we can use stuff license in a way and release something that elaborate this stuff (or maybe just a part) create issue for the whole project.

Probably Mozilla with legal team can help on understand this. Including the issue of that every country has different regulations…

Can you clarify what those two datasets you’re referring to are?

Fisher https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2004-fisher-corpus.pdf and Switchboard https://catalog.ldc.upenn.edu/LDC97S62 (this one a license that I don’t think that is open source)

Those datasets are proprietary, but they do have licenses, which we paid for.

So I am wondering in case of academic dataset with only citations we can use it to release a public domain model or we need to mention them? After all we are using those data to generate something else.

I would agree in principle, but I think we had to get that verified by lawyers.

Do you have some datasets already identified ?

An example in my case is http://www.mspkacorpus.it/, we already written to those email with no answers in over 10 days.
But is just an example how those dataset are released, no license just a citation to do.

If there is no license, the default is copyright, unfortunately. I was not involved in the negociations for the datasets we paid for, so I’m unsure how that plays here.

Just an update about this discussion.

Talking with people between All Hands and also Fosdem we defined that:

  • If a text corpus include stuff from a copyright resource we can aggregate all of them and remove the sentences that are not repeated like less of 10 time. In this way is kind of difficult to define the source.
  • A dataset used for a machine learning model that is without license (like our academic ones) could be used to generate it. This because the model don’t let to recreate the original and same file. It isn’t like a derived work but something different.
  • If data is public and unlicensed, the only thing safe you can do for sure is use it to produce a model that you keep to yourself with transfer learning and not release it

So I am looking to something for the second point, like an article somewhere on internet just to be sure and have a reference for that decision.