Questioning the +2 votes

Lately, I’ve been working on all datasets, working on statistics, health, diversity etc. Whenever I included a matrix of votes (up vs down) into the system, the data got huge. When I examined, I saw quite a few languages (larger, active datasets) have very high votes for some recordings. E.g. this is for English:

As you can see, there are votes which are more than 1000…

The +2 votes validation system is good for small values. 0-2, 1-3, 2-4… But if something is 100-102, that should point to a problem in the recording, even if it is validated finally.

I saw that while examining the Turkish recordings during writing a moderation software. In v11.0, we suddenly got a 6-8 tie-breaker (validated) recording, where the speaker actually introduced a word in a long sentence and as the result is a common phrase, you can miss it. Although I was listening to it carefully to find a mistake, I could only recognize it in my third listening.

So, if we have (in validated.tsv) 100 recordings with 100 up-votes, there should be 98 down-votes on them - and those recordings are most probably not right.

That means 100x100 + 100x98 = 19,800 (minus 2) unnecessary listen sessions, and loss of many volunteer time, and probably the result is wrong.

Best solution I can think is to stop feeding those recordings with downvotes > N (e.g. 5, 10, whatever) and flag them for professional review (e.g. by community core).

Second option is to move them to the bottom of the queue, so they will be listened when others run out (can easily be implemented on an SQL SELECT ORDER BY down+up DESC).

What do you think?

1 Like

I will add that many, many English sentences are mis-spelled or have errors and still get passed in the sentence collector and recorded … and maybe validated, when they are downright wrong.

I often reject recordings where the speaker auto-corrects the grammar. The text will say “He live in Georgia,” but the speaker fixes this as “He lives in Georgia,” which is obviously not kosher.

This is to add on the general point that the acceptance rates at every step of this project are way too high. I know, I know, access, inclusivity, yadda, yadda, yadda. But we need to be more discerning and pedantic. The machine needs to be useful, which won’t happen if it’s filled with downright mispronunciations (“debt” said as “debit”).

This isn’t about some ethno-centrism or white/bourgeois/America-first mentality. I can speak in my native NY accent or in a Trans-Atlantic/Theater accent and I can only get Google to recognize if I “toccck in a varry Cahliforrrrrrnian style” because the systems are all set to that training. Nothing takes me out of a spy thriller movie than a DC insider suddenly speaking like some surfer from Venice Beach. A machine trained to accept intelligible speech with an African’s sentence inflection is good. A machine filled with a bunch of Mancunian speakers auto-correcting mistyped American idioms or recorded hours of someone non-fluent saying “blood” as “brood” will not help.

The whole project is currently way too biased towards tolerance.

1 Like

I would love to see some statistics and empirical evaluation of this. e.g. what kind of recordings are good for training modern models of general purpose ASR. As far as I know there has been little written on it.

Sentence Collector is not the only way to input the sentences, there is bulk submission method and it might be more prone to such instances as only a sample of the whole set is human validated and if there is less than 5% error it is accepted.

I always do…

1 Like

You and me both. I have only seen academic papers that speak of inequities and racial consequences. I am all on board with acknowledging that facial recognition software is attuned to light skin tones and Caucasian facial features – a bad thing that we must fix. When it comes to voices and dialects, I would point out in conferences or discussions that this isn’t racial bias so much as “anyone who isn’t upper or middle class coastal California” as I point that Google et al. struggle with my voice and that of my friends (upstate NY, white) as much as with others, anecdotally.

Also, I said that including AAVE grammar and Indian English (phrases like “pre-pone,” which I honestly really like despite knowing they’re “non-standard”) as all part of one dialect of English opens the gates to then expecting systems to understanding Bogan, Cockney, Irish Traveller, and everything else under the sun. If the system understands Valley Girl, FDR recordings, and working-class people from Perth, then why doesn’t it understand Newfoundlanders?

I haven’t gotten satisfying responses from researchers beyond their saying “that is a good point,” but then they return to framing every practical and philosophical problem in a very race-conscious but international-blind way. I get it. They want to focus on the verifiable parts of their studies.

I’m not well-versed in the relevant literature in this specific field, but to my knowledge the litmus test for most of the academics who both specialize in computer recognition and social justice is “can the system understand a male IT employee in the Bay Area and also this 19-year-old young woman from Mobile, Alabama?”

This is actually a great opportunity to train an AI on context or to ask for clarification.

I meant “often there is a situation where I find myself forced to reject a sentence because the speaker corrected a plural/singular or conjugation or missing article mistake.” You may have interpreted my statement as “it happens that there is a wrong sentence recorded, and I sometimes vote ‘no’ on that sentence, but not always.”

In other words, I worded myself ambiguously. An analog could be the dangling participle (right term?) in “weary and shaken, our team and theirs decided to end the contest with a draw.” Who is weary and shaken? Was our team winning, but then became too tired, and called a draw? Were both teams weary and shaken?

In any case, I always reject recorded samples where the person inserts a word like “the” that wasn’t written down originally or corrects a word like “go” to “goes.”

Yeah, on reading it again I’ve got it :slight_smile: English is a foreign language to me, İstanbul dialect :confused:

Don’t feel bad for a second. I was ambiguous. It would trip up native speakers with a 50% chance. I didn’t even think you were non-native. Every language has its ambiguous things.

Korean, for example, has no differentiation between restrictive and non-restrictive modifiers. As a result, “I don’t like winters, those cold winters, that season of coldness (because I hate winters as a rule and love summers)” and “I don’t like cold winters (because I want to live in a place like Panama that doesn’t have a cold season” both come out as “추운 겨울이 싫다.”

English has a lot of ambiguity in general. The comedy in the dialog in the movie Clue is almost entirely centered around double entendres and misunderstandings.

I downvoted many clips in the english section recorded and replayed with double speed.

Also some spanish guys contributing !many! spanish clips in the english section.

I think the second batch you mention will not contribute to the above graph, unless it is done intentionally, they will be downvoted and gone.

The first one is more problematic, some people might validate them some invalidate.

But, I think the main reason lies in the text-corpus. Possibilities:

  • The reported sentences are not removed, so they keep being fed for recording. People record them as written or as corrected (typos etc). These can confuse the validators.
  • Foreign word pronunciation can be another issue.
  • Text corpus can include text of spoken language (or local dialects) and people can look at them as being wrong and downvote them.
  • etc

I looked at the reported statistics of the same dataset to see some correlation:

There are only 4,335 reporting for 3057 distinct sentences, mostly of grammar/spelling reasons - out of 981,444 unique sentences in validated. These numbers seem too low for me wrt total population and many bulk sentence PRs.

Maybe people are not reporting as it would need a couple of more clicks (UX issue) and instead keep recording/(in)validating.

1 Like

As mentioned above:

Many native speakers “auto correct” the sentences and let someone else do the reporting.
The next contributor is non native and speaks the sentence “as is” (with written error) also not reporting, and in the worst case repeat this “learned written error” in other sentences and so on…

Maybe reporting an error in Speak/Listen gives also 1 point for the leaderboard (after confimation) can motivate more contributers to report wrong sentences.

I reported many sentences in Speak/Listen with:
Missing full stops at the end of a sentence.
Mr./Mr, Dr./Dr, vs./vs, and so on sentences
Sentences starting with lower case
Cut off sentences (incomplete) that make no sense to record.

Or a bug hunter contest could be a booster for reporting sentences - the one with the most reported mistakes in the sentences gets a prize in a selected period of time.

1 Like

Missing full stops at the end of a sentence.
Sentences starting with lower case

I think these are OK as they do not affect the acoustic model.

Or a bug hunter contest

You would win that :slight_smile:

1 Like

Thanks for clarification.

Then a contest without “robovoice” - no signing up from my side…

Also your proposal(s) on github would improve validation a lot :smiley:

1 Like