So you suggest to train a language model on all prompts we already have then we accept/reject each new prompt, a sentence accepted by the sentence collector but not yet read by users, based upon its perplexity?
Right, we could either use it with a threshold to accept/reject, or just as a soft guide in showing users which sentences or words might be repetitive.
When a language starts it launches with only 5k sentences. Training a language model on these 5k sentences then accepting/rejecting new sentences based on this language model will highly bias new sentences to be just like the initial 5k
That’s the opposite of how it would work. No matter how small or large the data set, it will guide new sentences to be different from the initial ones, not just like them.
Let’s say we use a 2-gram language model, for example. From @dabinat’s problem sentence, if we train a language model on 10 sentences with the “All songs…” form and 5 other sentences, then the 2-grams “All songs”, “songs written”, “written by”, “except as”…etc. will each have counts of 10, while all other 2-grams in the language will have small or 0 counts, and thus give a much lower perplexity for sentences with this pattern.
Even though large, unbiased text corpora are needed for language models trained for other purposes, you can actually train language models with any amount of text, and this purpose does not have such a requirement.