Intelligibility remediation

Then we’d have an entire other set of problems. Spelling errors, spelling variations, punctuation problems… all in 100+ languages. I’d say this cure is worse than the disease.

Kelly, those are valid concerns, but have been addressed. As you mentioned in your April 4, 2018 podcast with Dustin Driver, people generally train speech recognition systems from transcripts, not merely booleans over entire utterances. It is easy to show that even without any spelling correction or homophone aliasing, you get orders of magnitude more information out of typed transcripts than a single bit for an entire multi-word utterance. Spelling correction can be addressed with soundex, or more elaborate systems as in https://arxiv.org/abs/cmp-lg/9702003 . Addressing homophones is a more substantial problem, because a transcriptionist making spelling errors is disclosing the poor quality of their knowledge, work, or both, but homophones can be identified with a phonetic dictionary like CMUDICT, which has been refined for almost half a century now. As for punctuation, we generally disregard it. I would be happy to share the transcription analysis scripts I’ve been using with Amazon Mechanical Turk.

I’m also glad to contribute my 84,000 transcripts from Turkers, about four each for thirty different marginal (often subpar, all near-subpar) recordings of K-6 ESL learners in the Chinese state schools attempting 700 of the most commonly spoken English words and phrases. You are most welcome to measure that data in terms of the useful information quantity compared to one bit per utterance, denominated by the volunteer crowdworker effort of thumbs-up/down versus typing a transcript.

I would also be happy to start with English alone, which is far and away the greatest in demand for second language learning, as an experimental starting point. Please let me know your thoughts.

1 Like

Do there exist “soundex” and/or 9702003 implementations that are MPL compatible for all the 104 languages of Common Voice? (Also, “soundex” alone can’t correct spelling. There would need to be a one-to-many “soundex” “inverse” and a language model to select the proper element of the inverse image.)

Do there exist MPL compatible phonetic dictionaries for all the 104 languages of Common Voice? (Also, a homophone malapropism could be not be fixed by a phonetic dictionary, but it could be fixed by a language model.)

Punctuation is disregarded? Why? Current TTS uses of Common Voice employ punctuation to induce the proper pauses for commas and periods and also intonation changes for questions.

I agree that the binary up-down system discards data that would otherwise be usable with transcript corrections, but my main concern would be scalability. The goal is 10,000 hours of data, which means literally millions of clips. If a clip ordinarily takes, say 4 seconds to review and now it takes 12 seconds, the 10,000 hour goal will take three times longer to achieve.

2 Likes

Kelly, I’ve looked over the IPA in Wiktionary. It does not have complete coverage for the less prevalent languages but it can be used for the words where it exists.

Should I propose this as an English only experiment to measure the resulting data collection rate and utility relative to the single bit approach?

I also figured out that it helps to close the eyes when listening to the recording and read the sentence then to check if it’s exactly what i heared.

Maybe we can fade in the text after the recording finished? (minus silence at the end)

1 Like

Could you please include actual citations of Nakagawa and Bernstein?

Bernstein et al. (1989) “Automatic evaluation and training in English pronunciation” in the First International Conference on Spoken Language Processing (ICSLP 1990, Kobe, Japan). The fundamental thirty year-old error occurs at this part of section 2.2.2, “Intelligibility” on the top left of third page:

Listeners differ considerably in their ability to predict unintelligible words, and this contributes to the imprecision of the results. Thus, it seems that the quality rating is a more desirable number to correlate with the automatic-grading score for sentences.

To show how pervasive the mistake is, pages 8-10 of this thesis from 2016 has a survey all the major ways people have avoided trying to predict authentic intelligibility, almost all using the posterior probability of the HMM recognition results (Bernstein et al’s “automatic-grading score”) and none of them trying to predict authentic intelligibility from training data measuring actual listeners’ comprehension of spoken utterances. This is why a YouTube search on “Rosetta Stone fail” today will show you dozens of people so upset with the pronunciation assessment performance of the largest commercial software offerings that they feel they have to publish a video demonstrating the problem. And why Educational Testing Service said last year that their latest SpeechRater 5.0 was only 58% accurate. It is astonishing to me that we have come to the point where people with relatively heavy accents have been able to use full-fledged dictation on mobile devices for years, with word error rates well under 5%, but ETS can’t score pronunciation much better than a coin flip!

There are two Nakagawa references which have the intelligibility method from 2011, but this newer paper, which cites one of them, is much easier to read and understand: Kibishi et al. (2015) “A statistical method of evaluating the pronunciation proficiency/intelligibility of English presentations by Japanese speakers,” ReCALL, 27(1), 58-83. The technique of actually trying to predict authentic intelligibility from listeners’ transcripts has only been implemented in one commercial system so far – which is free as in beer but not entirely open source (the client-server code is, but the mobile apps distributed by 17zuoye.com are not) used by 33 million K-6 ESL students in every province in China.

But don’t take my word for it. Look at the speech language pathologists or second language teaching literature complaints about the problem. E.g., O’Brien, et al. (2019) “Directions for the future of technology in pronunciation research and teaching,” Journal of Second Language Pronunciation (4)2:182-207 which for example on page 186 says “pronunciation researchers are primarily interested in improving L2 learners’ intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not.” Their discussion starting on page 192, “Collecting data through crowdsourcing,” is also very highly pertinent to Common Voice’s approach.

One last point on why intelligibility is superior to judging learners by their accent or prosodic features such as stress. The Common European Reference Framework for language competency uses these levels for phonological control, clearly showing that intelligibility is an earlier requirement (B1) than accent intonation (B2) or stress (C1):


I am eager to address whatever other questions there may be. In turn, I’d like to ask the community for suggestions for who might want to support this effort independently if it is deemed inappropriate for Mozilla?

Somehow reminds me of Chomsky’s funny comments on the Turing Test. Bernstein’s claim appears sensible, your ability to sing and whether the audience can figure out what song you’re singing may be two different (though overlapping) areas of investigation.

@kdavis I should have cited pp. 7-9, not 8-10 in Kyriakopoulos-2016.pdf| Kyriakopoulos’s 2016 dissertation in terms of where the error reoccurs. Anyway, if you agree this is a nice-to-have, then please follow https://phabricator.wikimedia.org/T166929 for updates.

@kdavis @nukeador I am hoping the Wikimedia Foundation will pay to place the 0.6 GB database mentioned at https://phabricator.wikimedia.org/T166929#5473028 into CC-BY-SA.

@kdavis Have I addressed all of your outstanding concerns? @nukeador if I have, I would like to schedule a talk to prepare a management proposal.

@jsalsman this is not my area of expertise, so I’ll defer to Kelly and his team to comment on this one.

Please see Can DB.vote be a boolean union with a utf-8 string?

For context, I’m also putting my email reply to James here about this proposal:

First of all, thank you so much for your interest and time to improve Common Voice, we really appreciate it.

I’ve had a conversation with the engineers at Deep Speech and they have evaluated your proposal. After analysis and taking in consideration the roadmap for 2020 and priorities, a decision has been made to not pursue your proposal at this point.

This is definitely something interesting, and we’ll keep it in case we want to consider it in the future.

Thanks for your understanding, as I commented, we really appreciate feedback and input from everyone.

Cheers.

@Nathan @asa @reuben would you like to replace Microsoft with DeepSpeech in the Stanford Almond OVAL? https://community.almond.stanford.edu/t/pulseaudio-event-sequence/153/2

I’m not sure whether @nukeador is referring to my proposal for collecting transcripts at Common Voice or fore replacing Microsoft speech recognition with DeepSpeech. I am happy to do as much or as little of either from afar if the Foundation is not, but I am sure both are strongly in your interest, and I would like to appeal any decision to leave either of them entirely to me, this year and next.

Accordingly, I am now asking that you devote at least 500 FTE hours to both integrating DeepSpeech 0.6 with the Stanford Almond OVAL, and collect transcripts instead of thumbs-up/down at Common Voice. I am sure both are excellent investments of your time. Please let me know your thoughts.

I was referring to the proposal for collecting transcripts. This is something out of our 2020 roadmap.

Thanks for your understanding.

@nukeador I will ask you to reconsider after I’ve submitted a pull request. I do not expect the design review to take more than a few hours, or the database change review to take more than ten at most.

@jsalsman please understand this project does implement features based on a roadmap, if it’s no approved, it won’t be implemented.

We appreciate your feedback and interest but this is not something we will consider for the product roadmap in 2020, this is a programmatic decision based on our current focus.

Thanks for your understanding.

@nukeador Where is the roadmap published? Would you rather I fork than submit a patch?