Grammatically poor sample sentences

Here’s a question: at what point is it no longer useful to get additional recordings of a sentence? I know DeepSpeech only uses one recording of a sentence. Even if someone else can make use of additional copies, there must surely be a point of diminishing return.

So essentially my question is: is it worth even bothering to import the old sentences that have existed for years and been recorded countless times already? Maybe they should just be retired once Sentence Collector has amassed a certain number of new sentences.

Here’s the situation I encountered while reviewing Greek sentences: it seems that the ones in there are fragments from a number of books. I couldn’t recognize the source of the ones I got just now (“Ας πούμε το άλφα είναι το πιο τυχερό”, “Μπορεί να ζευγαρώσει με ένα σωρό άλλα γραμματάκια”, “και σύμφωνα και φωνήεντα”) but it’s obvious that it’s a sentence from some existing text cut in three pieces at punctuation boundaries. This creates pieces of grammatically correct but incomplete language and, of course, not sentences that would appear in a regular speech corpus.

I found the source of couple of other sentences I got: it’s from the book “Παραμύθι χωρίς όνομα” which seems to be in the public domain (the author died in 1941) but it suffers from the same problem.

I’m not sure it makes sense to continue reviewing Greek sentences if the whole set is like this… I could just downvote everything that’s not a complete sentence, but even those seem ill-fit for the purpose and, if we can’t recognize the source, there could be copyright problems with them.

Might some “quick and dirty” automated screening of sentences be possible by use of the kind of language models used by DeepSpeech on the sentences submitted?

KenLM etc work by giving you the probability of a sentence and if the model is big enough to be representative then it should get that probability reasonably accurately and in turn might you find a threshold below which things are either excluded or flagged to the user.

Typically they would give sentences with spelling mistakes much lower scores. There is some risk of legitimate sentences being hit, but perhaps at the margins that’s not such a big deal (since this is to produce a dataset for pronunciation not to represent every phrase possible)

These models tend to be quick to run (for the probabilities, not to create), so am hoping it could be tacked on without killing the user experience. And whilst I agree something sophisticated like @jf99 envisages would be great, with the resources / time available, “good enough” may be more practical :slight_smile:

Finally, I see some LMs also let you see the probabilities of individual words within a sentence, such as here: https://colinmorris.github.io/lm-sentences/#/, so perhaps a heuristic could be used to filter (and this may provide a cheap way to flag to users the problem word(s) in a simple manner)

As we have just announced today, in the coming weeks we will ask for help to improve how we extract valid sentences from large sources of data

Anything to improve our algorithms is more than welcomed :slight_smile:

Cheers