Discussion of Biases
The dataset is validated by “thumbs up” or “thumbs down” voting by listeners, as opposed to typed transcriptions. This biases speech recognition and pronunciation assessment systems against accented speakers (see e.g. Kibishi and Nakagawa 2011, Loukina et al. 2015, and Gao et al. 2018.) Such biases prevent accurate speech-to-text and pronunciation scoring for the accented, including in high stakes assessments such as for immigration qualification (e.g., Australian Associated Press 2017, Ferrier 2017, Main and Watson 2022) forcing pronunciation assessment manufacturers to overhaul their offerings with transcription data capable of measuring genuine listener intelligibility. O’Brien et al. discuss this issue in “Directions for the future of technology in pronunciation research and teaching,” Journal of Second Language Pronunciation 4 (2):182-207, e.g. on page 186, stating, “pronunciation researchers are primarily interested in improving L2 learners’ intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not.” [Emphasis added; see also their discussion starting on page 192, “Collecting data through crowdsourcing.”] Mozilla’s EM Lewis-Jong discussed the trade-off of the greater quantity of data collection using binary voting at the expense of the greater quality of transcriptions typed by listeners during the Q&A portion of this NVIDIA Speech AI Summit session.
Please see also my attempt to address this here four years ago.
@Em.Lewis-Jong, isn’t there enough quantity now that you can afford to replace voting with transcribing? Please consider the effects described in https://www.bbc.com/news/uk-60264106 in particular. Thank you for your consideration of this issue.