Generate sentences using ML

FremyCompany · December 11, 2019, 1:09pm

Question here. I have a language model trained on sources which themselves are not CC0 but are suitable for research (for instance Wikipedia, EuroParl, etc…); there are enough of them that the resulting language model can’t possibly overfit these sources. If I use that language model to generate sentences, would those sentences be something I would be able to contribute as CC0 if I make sure they are not part of any of the datasets I used to train the language model in the first place (as in: are novel sentences).

In essence this doesn’t seem different from me, human, reading a lot of things then writing new sentences and releasing them as CC0 because I’m the author, even though I will of course inconsciously draw inspiration from the sentences I have already seen in my life prior to that point, but I don’t know what the legal standings are in this case.

nukeador · December 11, 2019, 1:11pm

Hi and welcome to the community!

Moving your message to a new topic.

If I understand correctly, you have a model which is able to write its own original sentences, which are not the same as the datasets it was trained on and you want to release these sentences under CC-0.

Is that correct? Do you have small sample of the sentences generated?

This looks like a super interesting thing to test and we can consult with our legal team about it, but it sounds promising

Cheers.

FremyCompany · December 11, 2019, 1:38pm

Hi! To be more precise, I am a doctoral student working on NLP (in Dutch in particular) at the University of Ghent (UGent).

One of my projects is to build a language model for (Flemish) Dutch. As a result, I have over the past months collected a lot of resources (of various origins, but all suitable for research; exact licensing is usually possible to track down though not always readily available for all sources). I can therefore use these sources to train language models.

I don’t have a Language Model right now that would be tailored to generate Common Voice-type of sentences, but I have enough data and tooling to build one if that would be useful.

Running that language model to generate new sentences shouldn’t be a problem, but of course these sentences would then need to be manually reviewed for language accuracy, though I would expect the kind of short sentence needed for Common Voice to be almost all correct grammatically (longer sentences usually have a bigger chance to be incorrect than short ones).

That said, I would be interested to know what licenses (for the training text) I would be able to use to train a language model to then contribute sentences in CC0. I would naively assume anything that can be used for research purposes is fair game if the sentences subsequently generated do not match the training set, but this is a question we really need a legal team to look at. Technically, wikipedia is a Share-Alike license, but would sentences generated after learning a language model on top of wikipedia have to be Share-Alike too? The model probably would if it were released, but would randomly generated sentences? That’s my question I guess.

Food for thought is that models like OpenAi/GPT2 is using the MIT license, despite being trained on randomly sampled data from articles linked on Reddit. Google/BERT is Apache 2.0. (It is my understanding that BERT is trained on a derivative of wikipedia, but I could be wrong)

FremyCompany · December 11, 2019, 2:05pm

Partly replying to my own question:

Once the aspect of reproduction
is properly addressed, we suggest refraining from defining
training models in terms of derivative/adapted works, with
the consequence that licensing restrictions (e.g. all rights
reserved, ND or SA) imposed on the input training resources
may not find application in the resulting output. At the same
time, we acknowledge that the scope of TM/NLP is too
broad to be handled homogeneously and that different types
of algorithms and parametrisations require dedicated legal
analysis, for example based on the level of abstraction they
attain over the input data and the type of original material
that is reproduced in the trained model.

(source)

FremyCompany · December 11, 2019, 2:08pm

My own reading is that “if the model generates sentences that only someone having read the original content would be able to generate, then it probably is derived work, but if the sentences it generates achieve a high-enough level of generality, then the model is not a derivative work, so as long as it does not reproduce the original data, the license of that data doesn’t matter”

But I am not a lawyer

nukeador · December 11, 2019, 2:27pm

Let me bring this to our next meeting with our legal team.

Cheers.

nukeador · December 17, 2019, 4:59pm

I have a few follow-up questions:

Did you use any material for training your model that specifically forbids its use for training machine learning?
How similar are the sentences generated from the sentences you used for training? (can you provide a few examples?)

Thanks, that would help us better analyze this issue

FremyCompany · December 20, 2019, 10:12am

Like I said, I don’t have a model trained on a restricted set of data that I’m confident would be ok to use for this purpose yet, but I can work on building one during the holiday period; I’ll get back to you when I’ve one ready.