Grammatically poor sample sentences

Iā€™m specifically talking about issues like ā€œhereā€ in place of ā€œhearā€, which I have seen in the text.

Got it, well I donā€™t know about this one.

We now have the Sentence Collector which provides a way to submit new sentences and have them validated. However, it has only been live for the past few weeks and many of the sentences pre-date that. In fact, English currently has a big backlog so you are reviewing sentences recorded around a year ago.

I believe the eventual plan is to resubmit the existing sentences through the Sentence Collector so they can be validated and fixed as needed.

We want to run an automated clean-up an on the existing sentences. We havenā€™t established yet a final process to solve the issue described here, the current one is to request the removal of these sentences and then re-submit them corrected to the sentence collector tool.

Any ideas to improve this process are welcomed! :slight_smile:

How many are we talking about? A good proof-reader should be able to check and edit at quite a speed; much quicker than it takes to speak them.

ā€˜Hearā€™ and ā€˜hereā€™ might be a distracting error in print, but would not necessarily affect the spoken words. Leonardo Di Caprio ā€˜staringā€™ in a film would, however.

1 Like

In my area, thereā€™s a big difference between ā€˜hearā€™ (heer) and here (hee-ya).

Even if it doesnā€™t make a difference to pronunciation, sometimes speakers pause or stumble when they encounter a mistake in the text.

2 Likes

We need a tool that lets you

  • search in all submitted sentences of a specified language
  • submit a corrected version of that sentence
  • provide a justification for the correction (e.g. a link to a dictionary)
  • review the corrections that others made (ideally allowing a discussion between corrector and reviewer)

I already collected dozens of mistakes in the German corpus. The longer we wait the more of them make it into it.

1 Like

I donā€™t expect we will be able to have the tool to do what you describe in the short term (we have other priorities and not a lot of resources), thatā€™s why the current proposal to at least ensure no bad sentences end up in the dataset is what I described:

  1. Request removal from the sentences list.
  2. Correct the sentences and submit them to the sentence collector so they end up in the site with their right form.

Existing sentences that can be identified automatically as ā€œwrongā€ will be removed by our cleaning scripts.

In any case your feedback is valuable and we can incorporate it into the list of things we would like to have in the future.

1 Like

The lack of resources is understandable. Where and how do I request the removal of several sentences? Shall I make a PR on Github?

Yes please, thanks for your understanding :slight_smile:

Suggestion: since the volunteer readers will all be seeing these sentences and having to inspect them attentively (we hope) before reading them aloud, could we in addition to the button saying ā€˜Skipā€™ - provide a button for ā€˜Error in this sentence?ā€™ That should act as a filter to collect as many as possible, narrowing the task for the second-level reviewers. It wonā€™t be infallible, but it could be a big (and inexpensive) step in the right direction.

4 Likes

Incidentally, you speak of needing to cite a reference to justify corrections. That sounds needlessly complex to me, and would be very difficult in the case of punctuation and some of the grammar errors. Youā€™ll find plenty of people who can identify that ā€˜shouldā€™ needs to be replaced with ā€˜willā€™ but far fewer who will be able to identify it as an inappropriate use of the future conditional subjunctive (for example). Why do you need that anyway?

Couldnā€™t it just be a system where people vote on whether they agree with the correction? Thatā€™s much simpler.

I suggest we donā€™t go too deep into solution ideation, since at the end of the day itā€™s not something we will be able to change today, and sometimes it tends to be an endless conversation about personal preferences.

I think itā€™s better to focus on describing the problem clearly so we can come back here for reference when we have time to start thinking on a proper solution :slight_smile:

2 Likes

Hereā€™s a question: at what point is it no longer useful to get additional recordings of a sentence? I know DeepSpeech only uses one recording of a sentence. Even if someone else can make use of additional copies, there must surely be a point of diminishing return.

So essentially my question is: is it worth even bothering to import the old sentences that have existed for years and been recorded countless times already? Maybe they should just be retired once Sentence Collector has amassed a certain number of new sentences.

Hereā€™s the situation I encountered while reviewing Greek sentences: it seems that the ones in there are fragments from a number of books. I couldnā€™t recognize the source of the ones I got just now (ā€œĪ‘Ļ‚ Ļ€ĪæĻĪ¼Īµ Ļ„Īæ Ī¬Ī»Ļ†Ī± ĪµĪÆĪ½Ī±Ī¹ Ļ„Īæ Ļ€Ī¹Īæ Ļ„Ļ…Ļ‡ĪµĻĻŒā€, ā€œĪœĻ€ĪæĻĪµĪÆ Ī½Ī± Ī¶ĪµĻ…Ī³Ī±ĻĻŽĻƒĪµĪ¹ Ī¼Īµ Ī­Ī½Ī± ĻƒĻ‰ĻĻŒ Ī¬Ī»Ī»Ī± Ī³ĻĪ±Ī¼Ī¼Ī±Ļ„Ī¬ĪŗĪ¹Ī±ā€, ā€œĪŗĪ±Ī¹ ĻƒĻĪ¼Ļ†Ļ‰Ī½Ī± ĪŗĪ±Ī¹ Ļ†Ļ‰Ī½Ī®ĪµĪ½Ļ„Ī±ā€) but itā€™s obvious that itā€™s a sentence from some existing text cut in three pieces at punctuation boundaries. This creates pieces of grammatically correct but incomplete language and, of course, not sentences that would appear in a regular speech corpus.

I found the source of couple of other sentences I got: itā€™s from the book ā€œĪ Ī±ĻĪ±Ī¼ĻĪøĪ¹ Ļ‡Ļ‰ĻĪÆĻ‚ ĻŒĪ½ĪæĪ¼Ī±ā€ which seems to be in the public domain (the author died in 1941) but it suffers from the same problem.

Iā€™m not sure it makes sense to continue reviewing Greek sentences if the whole set is like thisā€¦ I could just downvote everything thatā€™s not a complete sentence, but even those seem ill-fit for the purpose and, if we canā€™t recognize the source, there could be copyright problems with them.

Might some ā€œquick and dirtyā€ automated screening of sentences be possible by use of the kind of language models used by DeepSpeech on the sentences submitted?

KenLM etc work by giving you the probability of a sentence and if the model is big enough to be representative then it should get that probability reasonably accurately and in turn might you find a threshold below which things are either excluded or flagged to the user.

Typically they would give sentences with spelling mistakes much lower scores. There is some risk of legitimate sentences being hit, but perhaps at the margins thatā€™s not such a big deal (since this is to produce a dataset for pronunciation not to represent every phrase possible)

These models tend to be quick to run (for the probabilities, not to create), so am hoping it could be tacked on without killing the user experience. And whilst I agree something sophisticated like @jf99 envisages would be great, with the resources / time available, ā€œgood enoughā€ may be more practical :slight_smile:

Finally, I see some LMs also let you see the probabilities of individual words within a sentence, such as here: https://colinmorris.github.io/lm-sentences/#/, so perhaps a heuristic could be used to filter (and this may provide a cheap way to flag to users the problem word(s) in a simple manner)

As we have just announced today, in the coming weeks we will ask for help to improve how we extract valid sentences from large sources of data

Anything to improve our algorithms is more than welcomed :slight_smile:

Cheers