Q: Semi-synthetic sentences

Is it possible to “generate” our own sentences and add them to the sentence collector? Such as:

Today I will go to London.
Yesterday Bill went to London.
Alicia and Allen are in Amsterdam.
etc

I’m working on Turkish and the data should contain common person names, city names, country names etc. These will be semi-synthetic but I cannot think of any other method to get the data in. And I promise they will be CC0 :slight_smile:

If this is possible, are there any guidelines for this? How many times each of these names should be repeated to be enough?

I would generally say to not do that through the Sentence Collector. This would get quite boring to review and synthetic sentences probably could be reviewed with a different process. For example if these could be fully generated by a script, we could review the code and a sample of sentences instead of all sentences.

I have no idea whether (and how many) are actually useful. I’m sure @ftyers can say something about that.

I also though about that, but the above ones are just simple examples. Adding some 5-10% to the existing sentences would be more than enough and I think it will not be boring. If I only knew the actual amount :slight_smile:

I see that we have new sentences to be recorded, probably from Wikipedia. There a quite a few technical sentences such as polimer structures or domain names (domain dot com) etc, also with many words/places from English and other languages. I think we need to balance these…

Any alternative method / suggestion is also welcome.

In general I think it’s a bad idea to generate sentences. In terms of avoid stuff with too many domain names or difficult words/English words I think it is a good idea to have a frequency filter. For example, maybe avoid sentences that contain words under a certain frequency (start with 5 maybe).

One possible use of a sentence generator would be to try and mine for examples in the corpus. For example, generate the sentences and then see if you can find them in a corpus. It should be also possible to find highly frequent sentences which are not copyrightable, for example Ben içmeye gidiyorum. I do not believe is a copyrightable sentence.

Cheers Francis :slight_smile: (as of tomorrow we go into full lockdown, so I cannot go somewhere, bought a lot of bottles today thou)…

So, as far as I could understand, instead of a generator, I can be a “writer” and write regular sentences which might come up in everyday talks and inject the words I think missing in the dataset, such as common proper names. And repeat them at least 5 times so that they are recorded multiple times by multiple people and not filtered out by some random algorithm…

That’s a lot of work… If I were a blog writer I could just donate my entries, but alas…

1 Like

Another thing you could do is look at your chat logs. One could imagine using a large web corpus to extract very frequent sentences (let’s say over frequency of 100). These would be candidates for being public domain, as, taking Turkish as an example, individual sentences out of context like:

100	Bizim için yaptım.
100	Bizim de var.
100	Bizim davamızı düşürecekti.
100	Bizi çok korkuttun.
100	Bize ne getirdin?
100	Biz aynı taraftayız.
100	Bir sinyal alıyoruz.
100	Bir şey yapmamız lazım.
100	Bir şey söylemem lazım.
100	Birşey görüyor musun?
100	Bir saniye bekleyin lütfen.

cannot be copyrighted. But to be triple sure, you could take the set of sentences that you find in this way, and then intersect that with the set of sentences from your chat logs. This way you can legitimately claim authorship without needing a blog.

If you have 20 years or more of chat logs, like me, then you could end up with a lot of sentences! :slight_smile: Unfortunately mine are mostly in English, Spanish or Catalan, so not really ones that can be of use.

1 Like

This is a great idea ! I can start to collect them from current/work related ones, I can also inject some proper names into them… (I’m afraid most of my older chats are with my ex’es, not very suitable :stuck_out_tongue:)

What is the correct way of getting this kind of data in the the system? Sentence Collector?

I’m afraid there is no interest in Sentence Collector verification. I have 3800 sentences waiting there for a second thumbs-up. I wrote about it in a post a month ago:

Well, this is exactly the point, you don’t include your old chats entirely. You use this list to only include sentences from your old chats that are sufficiently generic. It’s a way of sharing only the generic/anonymised portion of your chats.

Regarding sentence collector, we need to discuss it further.