Grammatically poor sample sentences

I’m not talking about homophones, I reject when someone read “gonna” or “wanna” the they should have said “going to”/“want to”.

1 Like

I’m specifically talking about issues like “here” in place of “hear”, which I have seen in the text.

Got it, well I don’t know about this one.

We now have the Sentence Collector which provides a way to submit new sentences and have them validated. However, it has only been live for the past few weeks and many of the sentences pre-date that. In fact, English currently has a big backlog so you are reviewing sentences recorded around a year ago.

I believe the eventual plan is to resubmit the existing sentences through the Sentence Collector so they can be validated and fixed as needed.

We want to run an automated clean-up an on the existing sentences. We haven’t established yet a final process to solve the issue described here, the current one is to request the removal of these sentences and then re-submit them corrected to the sentence collector tool.

Any ideas to improve this process are welcomed! :slight_smile:

How many are we talking about? A good proof-reader should be able to check and edit at quite a speed; much quicker than it takes to speak them.

‘Hear’ and ‘here’ might be a distracting error in print, but would not necessarily affect the spoken words. Leonardo Di Caprio ‘staring’ in a film would, however.

1 Like

In my area, there’s a big difference between ‘hear’ (heer) and here (hee-ya).

Even if it doesn’t make a difference to pronunciation, sometimes speakers pause or stumble when they encounter a mistake in the text.


We need a tool that lets you

  • search in all submitted sentences of a specified language
  • submit a corrected version of that sentence
  • provide a justification for the correction (e.g. a link to a dictionary)
  • review the corrections that others made (ideally allowing a discussion between corrector and reviewer)

I already collected dozens of mistakes in the German corpus. The longer we wait the more of them make it into it.

1 Like

I don’t expect we will be able to have the tool to do what you describe in the short term (we have other priorities and not a lot of resources), that’s why the current proposal to at least ensure no bad sentences end up in the dataset is what I described:

  1. Request removal from the sentences list.
  2. Correct the sentences and submit them to the sentence collector so they end up in the site with their right form.

Existing sentences that can be identified automatically as “wrong” will be removed by our cleaning scripts.

In any case your feedback is valuable and we can incorporate it into the list of things we would like to have in the future.

1 Like

The lack of resources is understandable. Where and how do I request the removal of several sentences? Shall I make a PR on Github?

Yes please, thanks for your understanding :slight_smile:

Suggestion: since the volunteer readers will all be seeing these sentences and having to inspect them attentively (we hope) before reading them aloud, could we in addition to the button saying ‘Skip’ - provide a button for ‘Error in this sentence?’ That should act as a filter to collect as many as possible, narrowing the task for the second-level reviewers. It won’t be infallible, but it could be a big (and inexpensive) step in the right direction.


Incidentally, you speak of needing to cite a reference to justify corrections. That sounds needlessly complex to me, and would be very difficult in the case of punctuation and some of the grammar errors. You’ll find plenty of people who can identify that ‘should’ needs to be replaced with ‘will’ but far fewer who will be able to identify it as an inappropriate use of the future conditional subjunctive (for example). Why do you need that anyway?

Couldn’t it just be a system where people vote on whether they agree with the correction? That’s much simpler.

I suggest we don’t go too deep into solution ideation, since at the end of the day it’s not something we will be able to change today, and sometimes it tends to be an endless conversation about personal preferences.

I think it’s better to focus on describing the problem clearly so we can come back here for reference when we have time to start thinking on a proper solution :slight_smile:


Here’s a question: at what point is it no longer useful to get additional recordings of a sentence? I know DeepSpeech only uses one recording of a sentence. Even if someone else can make use of additional copies, there must surely be a point of diminishing return.

So essentially my question is: is it worth even bothering to import the old sentences that have existed for years and been recorded countless times already? Maybe they should just be retired once Sentence Collector has amassed a certain number of new sentences.

Here’s the situation I encountered while reviewing Greek sentences: it seems that the ones in there are fragments from a number of books. I couldn’t recognize the source of the ones I got just now (“Ας πούμε το άλφα είναι το πιο τυχερό”, “Μπορεί να ζευγαρώσει με ένα σωρό άλλα γραμματάκια”, “και σύμφωνα και φωνήεντα”) but it’s obvious that it’s a sentence from some existing text cut in three pieces at punctuation boundaries. This creates pieces of grammatically correct but incomplete language and, of course, not sentences that would appear in a regular speech corpus.

I found the source of couple of other sentences I got: it’s from the book “Παραμύθι χωρίς όνομα” which seems to be in the public domain (the author died in 1941) but it suffers from the same problem.

I’m not sure it makes sense to continue reviewing Greek sentences if the whole set is like this… I could just downvote everything that’s not a complete sentence, but even those seem ill-fit for the purpose and, if we can’t recognize the source, there could be copyright problems with them.

Might some “quick and dirty” automated screening of sentences be possible by use of the kind of language models used by DeepSpeech on the sentences submitted?

KenLM etc work by giving you the probability of a sentence and if the model is big enough to be representative then it should get that probability reasonably accurately and in turn might you find a threshold below which things are either excluded or flagged to the user.

Typically they would give sentences with spelling mistakes much lower scores. There is some risk of legitimate sentences being hit, but perhaps at the margins that’s not such a big deal (since this is to produce a dataset for pronunciation not to represent every phrase possible)

These models tend to be quick to run (for the probabilities, not to create), so am hoping it could be tacked on without killing the user experience. And whilst I agree something sophisticated like @jf99 envisages would be great, with the resources / time available, “good enough” may be more practical :slight_smile:

Finally, I see some LMs also let you see the probabilities of individual words within a sentence, such as here:, so perhaps a heuristic could be used to filter (and this may provide a cheap way to flag to users the problem word(s) in a simple manner)