Bias against accented speech from voting instead of transcribing

On a suggestion by @solana at the Stanford Human-centered AI seminar this morning, I filed this pull request with the Common Voice dataset card:

Discussion of Biases

The dataset is validated by “thumbs up” or “thumbs down” voting by listeners, as opposed to typed transcriptions. This biases speech recognition and pronunciation assessment systems against accented speakers (see e.g. Kibishi and Nakagawa 2011, Loukina et al. 2015, and Gao et al. 2018.) Such biases prevent accurate speech-to-text and pronunciation scoring for the accented, including in high stakes assessments such as for immigration qualification (e.g., Australian Associated Press 2017, Ferrier 2017, Main and Watson 2022) forcing pronunciation assessment manufacturers to overhaul their offerings with transcription data capable of measuring genuine listener intelligibility. O’Brien et al. discuss this issue in “Directions for the future of technology in pronunciation research and teaching,” Journal of Second Language Pronunciation 4 (2):182-207, e.g. on page 186, stating, “pronunciation researchers are primarily interested in improving L2 learners’ intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not.” [Emphasis added; see also their discussion starting on page 192, “Collecting data through crowdsourcing.”] Mozilla’s EM Lewis-Jong discussed the trade-off of the greater quantity of data collection using binary voting at the expense of the greater quality of transcriptions typed by listeners during the Q&A portion of this NVIDIA Speech AI Summit session.

Please see also my attempt to address this here four years ago.

@Em.Lewis-Jong, isn’t there enough quantity now that you can afford to replace voting with transcribing? Please consider the effects described in in particular. Thank you for your consideration of this issue.

-Jim Salsman
Mountain View

@gregor and @Em.Lewis-Jong, now that there is such a large volume of voted utterances, and nothing promised regarding accents ever addressed the underlying issue described above, can the transition be eased by allowing the vote table to hold both booleans and text strings? Perhaps by adding another column for at please?


Hi Jim.

Happy to speak more about this, and we’re always trying to evolve our approach!

Firstly - the community voting system is not about accent - it is specifically there to ensure audio recording quality, and word-to-utterance accuracy. Non-natively accented speech is actively and profoundly welcomed at Common Voice, and this is made clear throughout the contribution guidelines.

We have spent the last two years running various initiatives and camaigns aimed at improving speaker diversity in datasets, with a great deal of success. We have also been doing more to help educate new communities in on-boarding about the benefits of speaker diversity. If people reject a clip on the basis of accent, then they have made a mistake, and it is counter to a) the aims of the project, b) the specific guidelines we release, and c) the messaging in our mobilisation work.

We find that by having multiple rounds of user review for any given clip, individual user error like downvoting clips incorrectly can be minimised - and we still release all downvoted clips so that this data isn’t lost for usage.

We have already evolved our approach to accent metadata to make it even easier for people to express the richness of their accent. Accent metadata is freeform specifically to enable us to capture the diversity and variety of accents, including the accents of multilingual speakers.

To your point about transcription possibly being a less biased mechanism - we are in fact exploring using this approach for new data formats like spontaneous speech, likely to be released in Beta later in 2023! We will then have more concrete data on the benefits and risks of these different approaches.

If you’d like to be part of the community research group that looks at this data, we would welcome that!


1 Like

I’d be happy to help, @Em.Lewis-Jong; just let me know how.

The problem with voting, beyond representing the quality of an entire utterance with a single bit, is that it’s always a subjective judgement as to whether a non-standard pronunciation is incorrect or merely an accent. Raters will apply their personal standard, not because they are ignoring your instructions, but because volunteers aren’t phonology experts, and are simply incapable of reliably discerning between accents and errors, even when a potential error would always be inconsequential. Volunteer raters simply don’t know, so even when they believe they are complying with messaging to tell them to allow accents, it’s literally impossible for them to do so. The coarse granularity of voting on the entire utterance means that aggregation doesn’t effectively alleviate this problem.

Downstream, dataset users are going to have downvotes and upvotes, but no information about the part of the utterance which was disapproved. Their speech recognition models will discriminate, as will their pronunciation assessment models.

Transcription solves both of those problems. Whether a listener’s transcript matches the intended speech is an objective measure of genuine intelligibility, and agregates in a way that the specific obstruction to listener comprehension is identified at the word level and often also at the phoneme level. It doesn’t require consulting a professional phonologist with an understanding of the L1 and L2 population involved to discern between accent and error. It also addresses the fact that heavily accented speech can be and often is incomprehensible to most listeners.

Regarding the use of geographic location to represent accent, frankly it’s completely insufficient. Accent involves not just location, but the background of the population in which a learner forms their cerebellar dexterity, including the same fractal backgrounds of their parents, teachers, peers, and the e.g. shopkeepers and officials with whom they must interact in their community, not to mention the media to which they listen. But representing an accent with a location will not cause bias or discrimination, primarily because there’s nothing model builders can use that information to do.

I’m absolutely fascinated by your spontaneous speech initiative. How can I learn more about it?

Dear Jsalsman,

Thank you for your interest in Common Voice.

It’s also worth noting that transcription introduces new problems, it requires speakers to be able to write in some normalised form as well as by able read.

This is not how it works, the accent submission is broadly free form, so people are able to self-identify their accents as they wish.

We would be very interested in concrete and quantitative substantiation for your statements regarding accents in Common Voice datasets if you have it available.

If you’re looking for a better way to characterize accents than geography or self-description, voiced phone formant characterization (e.g. as in is objective, data-based, and correlates well with ethnicities.

If you’re asking me to substantiate my assertion that nobody can use the accent characterizations you provide, I can’t prove a negative, and you’re in a better position to provide example(s) of how they are being used.

How can I get involved with the spontaneous speech initiatives?

I think the system we have based on self-description is a good start, of course it can be improved, there are many possible avenues for that. But one of the main benefits of that is that it does not require specialist input and only relies on speakers themselves, meaning it can scale arbitrarily (e.g. to thousands of languages). In addition it is based on speakers’ self-perception, meaning that we are not in the position to mis-identify or mis-classify people, and we do not exclude people by forcing them into pre-defined categories. What is done with that data later is a question for research. In terms of what is better there are many dimensions on which a particular approach can be better, but it is unlikely that a single approach is better on all of those dimensions.

One thing you could try is random sampling from the rejected clips and see what percentage are rejected because of accent rather than other reasons. That would be an interesting study to do.

I imagine that further information will be provided here and on Matrix later in the year.

1 Like