bozden
(Bülent Özden)
January 18, 2023, 12:24pm
20
There are already some topics where these are discussed. The first one was asked by myself for example.
As you will see, these are bad practices: Dumping the vocabulary, auto-generate sentences by concatenating words, using a bad AI to generate them, etc.
Also, if you use some AI text generator (such as ChatGPT) you must be sure of the copyright of the output, it MUST be CC0/public domain, GPL etc is a no go.
The problem is: If you do not add quality text-corpus, the quality of the dataset will drop and there is usually no going back. Whenever some text is recorded in audio, it will stick into the dataset and it can only be removed with post-processing before training.
That makes me feel useful.
It is OK (and advisable) to write your own sentences looking at a dictionary though.
Is it possible to “generate” our own sentences and add them to the sentence collector? Such as:
Today I will go to London.
Yesterday Bill went to London.
Alicia and Allen are in Amsterdam.
etc
I’m working on Turkish and the data should contain common person names, city names, country names etc. These will be semi-synthetic but I cannot think of any other method to get the data in. And I promise they will be CC0
If this is possible, are there any guidelines for this? How many…
Question here. I have a language model trained on sources which themselves are not CC0 but are suitable for research (for instance Wikipedia, EuroParl, etc…); there are enough of them that the resulting language model can’t possibly overfit these sources. If I use that language model to generate sentences, would those sentences be something I would be able to contribute as CC0 if I make sure they are not part of any of the datasets I used to train the language model in the first place (as in: ar…
From my point of view, as a native speaker of the Russian language, sentences have a variety of problems:
Russian sentences have a very strong political bias, and strong clericalism. In life, not so many people speak in that way, and I’m afraid that the neural network will have issues with sentences recognition based on such a dataset.
Many sentences in the dataset are duplicated: they differ in several words, or even letters. I will give an example:
Выступление президента Республики Ки…
I just started reviewing sentences in my mothers tongue, Norwegian (Bokmål). There is an abundance of only one word sentences. Should I just approve all of these or should they be rejected?
In the review queue for Swedish submissions in Sentence Collector, there are thousands of entries with the source “Project Gutenberg, with slight tweaks by me.”. These seem to be based on texts with old style grammar (like plural verbs that haven’t been used in Swedish since the first half of the last century and word order that almost all people would find odd nowadays) and spelling that are not nearly tweaked enough to work satisfactorily to be used in Common Voice. There are also sentences wi…
While validating content for Italian, I have come across a lot of low quality sentences that seems to have made it past the sentence collector.
Should I hit no on those? Skip? I imagine it would be better if they could be taken out of circulation entirely, so that users don’t waste their time speaking them.
Some examples:
Lots of foreign technical terms, some of them highly specific to software development (did someone upload technical documentation?). Even for more common terms, like open s…
Please search for more, there are many…