Intelligibility remediation

Bernstein et al. (1989) “Automatic evaluation and training in English pronunciation” in the First International Conference on Spoken Language Processing (ICSLP 1990, Kobe, Japan). The fundamental thirty year-old error occurs at this part of section 2.2.2, “Intelligibility” on the top left of third page:

Listeners differ considerably in their ability to predict unintelligible words, and this contributes to the imprecision of the results. Thus, it seems that the quality rating is a more desirable number to correlate with the automatic-grading score for sentences.

To show how pervasive the mistake is, pages 8-10 of this thesis from 2016 has a survey all the major ways people have avoided trying to predict authentic intelligibility, almost all using the posterior probability of the HMM recognition results (Bernstein et al’s “automatic-grading score”) and none of them trying to predict authentic intelligibility from training data measuring actual listeners’ comprehension of spoken utterances. This is why a YouTube search on “Rosetta Stone fail” today will show you dozens of people so upset with the pronunciation assessment performance of the largest commercial software offerings that they feel they have to publish a video demonstrating the problem. And why Educational Testing Service said last year that their latest SpeechRater 5.0 was only 58% accurate. It is astonishing to me that we have come to the point where people with relatively heavy accents have been able to use full-fledged dictation on mobile devices for years, with word error rates well under 5%, but ETS can’t score pronunciation much better than a coin flip!

There are two Nakagawa references which have the intelligibility method from 2011, but this newer paper, which cites one of them, is much easier to read and understand: Kibishi et al. (2015) “A statistical method of evaluating the pronunciation proficiency/intelligibility of English presentations by Japanese speakers,” ReCALL, 27(1), 58-83. The technique of actually trying to predict authentic intelligibility from listeners’ transcripts has only been implemented in one commercial system so far – which is free as in beer but not entirely open source (the client-server code is, but the mobile apps distributed by 17zuoye.com are not) used by 33 million K-6 ESL students in every province in China.

But don’t take my word for it. Look at the speech language pathologists or second language teaching literature complaints about the problem. E.g., O’Brien, et al. (2019) “Directions for the future of technology in pronunciation research and teaching,” Journal of Second Language Pronunciation (4)2:182-207 which for example on page 186 says “pronunciation researchers are primarily interested in improving L2 learners’ intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not.” Their discussion starting on page 192, “Collecting data through crowdsourcing,” is also very highly pertinent to Common Voice’s approach.

One last point on why intelligibility is superior to judging learners by their accent or prosodic features such as stress. The Common European Reference Framework for language competency uses these levels for phonological control, clearly showing that intelligibility is an earlier requirement (B1) than accent intonation (B2) or stress (C1):


I am eager to address whatever other questions there may be. In turn, I’d like to ask the community for suggestions for who might want to support this effort independently if it is deemed inappropriate for Mozilla?