How unique should a sentence be?

There are lots of sentences like the following in the Wikipedia dump:

All songs written by Trey Anastasio, except as noted.

The artist name in each sentence is unique, but all surrounding words are either almost or exactly identical each time.

While the sentences are technically unique, is the similar/identical wording overly biasing the model and they should therefore be removed?

cc @kdavis

Good question, and as usual for complicated questions the answer is “it depends”.

As Common Voice is not application specific–not geared toward radio, assistants, or any other particular use case–the sentences selected should reflect the distribution of sentences one sees “in the wild” for STT applications. However, no one knows this distribution. So the question is what to do?

The path we taking is to collect sentences whose distribution is uniform as possible. Operationally we’re implementing this as “no repeats”. However, as you mention, there is a gray area.

If we remove most of your “All songs written by…” sentences, we remove a bias of any acoustic model created by the data. However, if we remove most of your “All songs written by…” sentences we also remove a large source of speech-to-text data for proper nouns. Maybe one is better then the other? I’m not sure.

However, I am sure that if we introduce such “almost repeat” rules, we’ll be going down a rabbit hole very soon. There are 104 languages currently being worked on for Common Voice. Mozilla and the community doesn’t have the resources to write these “almost repeat” rules for each of these 104 languages. So I’d suggest simply allowing for these “almost repeats” for now.

1 Like

A common approach for this would be to train a language model on the prompts, and then measure the perplexity on a held-out test set. The perplexity would indicate how predictable the test set is. It can also be evaluated on individual sentences. Maybe this metric would be good to have when evaluating potential prompt sentences. It should scale fairly easily to the 104 languages.

Yeah, I guess we should avoid pre-filtering stuff until we know it’s actually a problem.

A good idea, but I worry about obtaining large enough unbiased text corpora to create accurate language models to judge new sentences with.

For example, the problem we are talking about is in Wikipedia data. So if we exclude Wikipedia from our language model, we will have a hard time finding enough text for under resourced languages like Hakha Chin.

Do you have an idea where we can obtain large unbiased text corpora for all the 104 languages that are currently being working on?

I’m talking about training a model only on the prompts. The purpose is exclusively to judge the perplexity of individual prompts in relation to the whole, so we don’t want other text.

So you suggest to train a language model on all prompts we already have then we accept/reject each new prompt, a sentence accepted by the sentence collector but not yet read by users, based upon its perplexity?

I think this can have the same problem I was talking about, obtaining a large unbiased data set is hard. In more detail…

When a language starts it launches with only 5k sentences. Training a language model on these 5k sentences then accepting/rejecting new sentences based on this language model will highly bias new sentences to be just like the initial 5k and it’s doubtful that these 5k sentences will reflex the full diversity of a language. In other words, 5k is not enough sentences to get a good model of a language.

However, for languages in which we have lots of sentences collected already I think your idea would work. But I guess the question then is: What does “lots of sentences” mean? In other words, how many sentences are needed before we can switch from a hand-curated to an automated acceptance?

So you suggest to train a language model on all prompts we already have then we accept/reject each new prompt, a sentence accepted by the sentence collector but not yet read by users, based upon its perplexity?

Right, we could either use it with a threshold to accept/reject, or just as a soft guide in showing users which sentences or words might be repetitive.

When a language starts it launches with only 5k sentences. Training a language model on these 5k sentences then accepting/rejecting new sentences based on this language model will highly bias new sentences to be just like the initial 5k

That’s the opposite of how it would work. No matter how small or large the data set, it will guide new sentences to be different from the initial ones, not just like them.

Let’s say we use a 2-gram language model, for example. From @dabinat’s problem sentence, if we train a language model on 10 sentences with the “All songs…” form and 5 other sentences, then the 2-grams “All songs”, “songs written”, “written by”, “except as”…etc. will each have counts of 10, while all other 2-grams in the language will have small or 0 counts, and thus give a much lower perplexity for sentences with this pattern.

Even though large, unbiased text corpora are needed for language models trained for other purposes, you can actually train language models with any amount of text, and this purpose does not have such a requirement.