Just as the title we (for Italian) have a problem.
Right now the majority of datasets are from the academic world and they don’t have any license but need a citation of the paper.
So for the italian model https://github.com/MozillaItalia/DeepSpeech-Italian-Model/ we are avoiding them because we don’t know how to deal with them.
So my question is we can use them and release a public domain model? Or we need to mention that we are using and also the users that use the model itself?
We have the same problem for audio+text and text only dataset, also on using CC (also non-commercial) to generate a model.
Because if we can use those stuff and license the model as public domain also if we are using to generate it resources from different sources with different license, will change our project because we will not have any limit.
The point we raised is we can use stuff license in a way and release something that elaborate this stuff (or maybe just a part) create issue for the whole project.
Probably Mozilla with legal team can help on understand this. Including the issue of that every country has different regulations…
So I am wondering in case of academic dataset with only citations we can use it to release a public domain model or we need to mention them? After all we are using those data to generate something else.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
I would agree in principle, but I think we had to get that verified by lawyers.
An example in my case is http://www.mspkacorpus.it/, we already written to those email with no answers in over 10 days.
But is just an example how those dataset are released, no license just a citation to do.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
8
If there is no license, the default is copyright, unfortunately. I was not involved in the negociations for the datasets we paid for, so I’m unsure how that plays here.
Talking with people between All Hands and also Fosdem we defined that:
If a text corpus include stuff from a copyright resource we can aggregate all of them and remove the sentences that are not repeated like less of 10 time. In this way is kind of difficult to define the source.
A dataset used for a machine learning model that is without license (like our academic ones) could be used to generate it. This because the model don’t let to recreate the original and same file. It isn’t like a derived work but something different.
If data is public and unlicensed, the only thing safe you can do for sure is use it to produce a model that you keep to yourself with transfer learning and not release it
So I am looking to something for the second point, like an article somewhere on internet just to be sure and have a reference for that decision.