Hi! I’m Jim Salsman, and I’ve been working on pronunciation assessment for language learning and intelligibility remediation since 1996. I’ve been focusing on an issue which was introduced by Bernstein et al in their 1989 paper, causing a bug which still exists in Pearson, Rosetta Stone, Duolingo, Berlitz, Educational Testing Service, and almost all other pronunciation assessment software today, and wasn’t corrected until the intelligibility assessment work of Professor Seiichi Nakagawa and his graduate students in 2011, and is still in only one commercial pronunciation remediation product today. I’ve proposed working on this in the IEEE Learning Technology Standards Committee and the Wikimedia Foundation, and I believe the best way to do so is to produce a pull request for Mozilla Common Voice, changing the binary thumbs-up-good/thumbs-up-down feedback for likely exemplary pronunciations into transcription requests. If you are interested in this topic, please join the Spoken Language Interest Group at http://bit.ly/slig
(direct link: https://docs.google.com/forms/d/e/1FAIpQLSfzcN7gPHh9iIXwgASNNsTtwLXmsRa6h10KGKM-iNgDwuN3Aw/viewform )
Hi Jim,
I’m not a linguistic expert and I have some troubles trying to understand what are you proposing here.
Can you help us better understand if you have a specific request on how we do accents at Common Voice? Can you illustrate with an example on what and how this would change our strategy or site?
Thanks and welcome to the community!
Sure, Rubén. Instead of asking people a binary thumbs-up-good/thumbs-down-bad response for utterances which you believe people were trying to pronounce correctly, you could be asking for transcriptions of subpar pronunciations, which produces more information about the intelligibility of the utterance.
I wrote a paper about why this is important at https://arxiv.org/abs/1709.01713 and you can see confirmation of the need from the first reference at the top of http://bit.ly/slig
In short, everyone whether they are a native speaker or not is already quite proficient at being able to comprehend accented speech. If anyone in the English speaking world hears the words “Las Wegas,” they know it is formally wrong, but even moreso, they know it is a city in Nevada.
All of the pronunciation remediation systems available commercially (some of which are used in incredibly high-stakes situations, such as immigrating to the UK or getting a permanent residency card in Australia) depend on the Bernstein et al. (1989) method of judging the speech in relation to a specific “best” accent, which for example keeps Irish speakers from being able to get their Australian green cards unless they take a special class. If you have a few minutes, search for “Rosetta Stone fail” on Youtube, and see why the Educational Testing Service says the state of the art in pronunciation assessment is only about 58%, barely better than flipping a coin.
By trying to predict authentic intelligibility instead of fluency relative to a specific “best” accent, we can overcome this, and we can do it for free, using client-side speech recognition such as the PocketSphinx.js system described in the paper above.
I believe this can produce second language pronunciation tutoring for free in multiple languages for everyone, and I’m committed to making it happen. Please let me know your thoughts.
So basically asking people to not read the text but hiding the text and asking them to write down what they heard, right?
I remember @rosana mentioned something similar as a good practice when validating sentences, because you tend to assume that what it is written is what you are hearing.
@kdavis any though on this? Would this data be more useful for our models training?
Then we’d have an entire other set of problems. Spelling errors, spelling variations, punctuation problems… all in 100+ languages. I’d say this cure is worse than the disease.
Kelly, those are valid concerns, but have been addressed. As you mentioned in your April 4, 2018 podcast with Dustin Driver, people generally train speech recognition systems from transcripts, not merely booleans over entire utterances. It is easy to show that even without any spelling correction or homophone aliasing, you get orders of magnitude more information out of typed transcripts than a single bit for an entire multi-word utterance. Spelling correction can be addressed with soundex, or more elaborate systems as in https://arxiv.org/abs/cmp-lg/9702003 . Addressing homophones is a more substantial problem, because a transcriptionist making spelling errors is disclosing the poor quality of their knowledge, work, or both, but homophones can be identified with a phonetic dictionary like CMUDICT, which has been refined for almost half a century now. As for punctuation, we generally disregard it. I would be happy to share the transcription analysis scripts I’ve been using with Amazon Mechanical Turk.
I’m also glad to contribute my 84,000 transcripts from Turkers, about four each for thirty different marginal (often subpar, all near-subpar) recordings of K-6 ESL learners in the Chinese state schools attempting 700 of the most commonly spoken English words and phrases. You are most welcome to measure that data in terms of the useful information quantity compared to one bit per utterance, denominated by the volunteer crowdworker effort of thumbs-up/down versus typing a transcript.
I would also be happy to start with English alone, which is far and away the greatest in demand for second language learning, as an experimental starting point. Please let me know your thoughts.
Do there exist “soundex” and/or 9702003 implementations that are MPL compatible for all the 104 languages of Common Voice? (Also, “soundex” alone can’t correct spelling. There would need to be a one-to-many “soundex” “inverse” and a language model to select the proper element of the inverse image.)
Do there exist MPL compatible phonetic dictionaries for all the 104 languages of Common Voice? (Also, a homophone malapropism could be not be fixed by a phonetic dictionary, but it could be fixed by a language model.)
Punctuation is disregarded? Why? Current TTS uses of Common Voice employ punctuation to induce the proper pauses for commas and periods and also intonation changes for questions.
I agree that the binary up-down system discards data that would otherwise be usable with transcript corrections, but my main concern would be scalability. The goal is 10,000 hours of data, which means literally millions of clips. If a clip ordinarily takes, say 4 seconds to review and now it takes 12 seconds, the 10,000 hour goal will take three times longer to achieve.
Kelly, I’ve looked over the IPA in Wiktionary. It does not have complete coverage for the less prevalent languages but it can be used for the words where it exists.
Should I propose this as an English only experiment to measure the resulting data collection rate and utility relative to the single bit approach?
I also figured out that it helps to close the eyes when listening to the recording and read the sentence then to check if it’s exactly what i heared.
Maybe we can fade in the text after the recording finished? (minus silence at the end)
Could you please include actual citations of Nakagawa and Bernstein?
Bernstein et al. (1989) “Automatic evaluation and training in English pronunciation” in the First International Conference on Spoken Language Processing (ICSLP 1990, Kobe, Japan). The fundamental thirty year-old error occurs at this part of section 2.2.2, “Intelligibility” on the top left of third page:
Listeners differ considerably in their ability to predict unintelligible words, and this contributes to the imprecision of the results. Thus, it seems that the quality rating is a more desirable number to correlate with the automatic-grading score for sentences.
To show how pervasive the mistake is, pages 8-10 of this thesis from 2016 has a survey all the major ways people have avoided trying to predict authentic intelligibility, almost all using the posterior probability of the HMM recognition results (Bernstein et al’s “automatic-grading score”) and none of them trying to predict authentic intelligibility from training data measuring actual listeners’ comprehension of spoken utterances. This is why a YouTube search on “Rosetta Stone fail” today will show you dozens of people so upset with the pronunciation assessment performance of the largest commercial software offerings that they feel they have to publish a video demonstrating the problem. And why Educational Testing Service said last year that their latest SpeechRater 5.0 was only 58% accurate. It is astonishing to me that we have come to the point where people with relatively heavy accents have been able to use full-fledged dictation on mobile devices for years, with word error rates well under 5%, but ETS can’t score pronunciation much better than a coin flip!
There are two Nakagawa references which have the intelligibility method from 2011, but this newer paper, which cites one of them, is much easier to read and understand: Kibishi et al. (2015) “A statistical method of evaluating the pronunciation proficiency/intelligibility of English presentations by Japanese speakers,” ReCALL, 27(1), 58-83. The technique of actually trying to predict authentic intelligibility from listeners’ transcripts has only been implemented in one commercial system so far – which is free as in beer but not entirely open source (the client-server code is, but the mobile apps distributed by 17zuoye.com are not) used by 33 million K-6 ESL students in every province in China.
But don’t take my word for it. Look at the speech language pathologists or second language teaching literature complaints about the problem. E.g., O’Brien, et al. (2019) “Directions for the future of technology in pronunciation research and teaching,” Journal of Second Language Pronunciation (4)2:182-207 which for example on page 186 says “pronunciation researchers are primarily interested in improving L2 learners’ intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not.” Their discussion starting on page 192, “Collecting data through crowdsourcing,” is also very highly pertinent to Common Voice’s approach.
One last point on why intelligibility is superior to judging learners by their accent or prosodic features such as stress. The Common European Reference Framework for language competency uses these levels for phonological control, clearly showing that intelligibility is an earlier requirement (B1) than accent intonation (B2) or stress (C1):
I am eager to address whatever other questions there may be. In turn, I’d like to ask the community for suggestions for who might want to support this effort independently if it is deemed inappropriate for Mozilla?
Somehow reminds me of Chomsky’s funny comments on the Turing Test. Bernstein’s claim appears sensible, your ability to sing and whether the audience can figure out what song you’re singing may be two different (though overlapping) areas of investigation.
@kdavis I should have cited pp. 7-9, not 8-10 in Kyriakopoulos-2016.pdf| Kyriakopoulos’s 2016 dissertation in terms of where the error reoccurs. Anyway, if you agree this is a nice-to-have, then please follow https://phabricator.wikimedia.org/T166929 for updates.
@kdavis @nukeador I am hoping the Wikimedia Foundation will pay to place the 0.6 GB database mentioned at https://phabricator.wikimedia.org/T166929#5473028 into CC-BY-SA.
@kdavis Have I addressed all of your outstanding concerns? @nukeador if I have, I would like to schedule a talk to prepare a management proposal.
@jsalsman this is not my area of expertise, so I’ll defer to Kelly and his team to comment on this one.
For context, I’m also putting my email reply to James here about this proposal:
First of all, thank you so much for your interest and time to improve Common Voice, we really appreciate it.
I’ve had a conversation with the engineers at Deep Speech and they have evaluated your proposal. After analysis and taking in consideration the roadmap for 2020 and priorities, a decision has been made to not pursue your proposal at this point.
This is definitely something interesting, and we’ll keep it in case we want to consider it in the future.
Thanks for your understanding, as I commented, we really appreciate feedback and input from everyone.
Cheers.
@Nathan @asa @reuben would you like to replace Microsoft with DeepSpeech in the Stanford Almond OVAL? https://community.almond.stanford.edu/t/pulseaudio-event-sequence/153/2
I’m not sure whether @nukeador is referring to my proposal for collecting transcripts at Common Voice or fore replacing Microsoft speech recognition with DeepSpeech. I am happy to do as much or as little of either from afar if the Foundation is not, but I am sure both are strongly in your interest, and I would like to appeal any decision to leave either of them entirely to me, this year and next.
Accordingly, I am now asking that you devote at least 500 FTE hours to both integrating DeepSpeech 0.6 with the Stanford Almond OVAL, and collect transcripts instead of thumbs-up/down at Common Voice. I am sure both are excellent investments of your time. Please let me know your thoughts.