Bias against accented speech from voting instead of transcribing

On a suggestion by @Solana at the Stanford Human-centered AI seminar this morning, I filed this pull request with the Common Voice dataset card:

Discussion of Biases

The dataset is validated by “thumbs up” or “thumbs down” voting by listeners, as opposed to typed transcriptions. This biases speech recognition and pronunciation assessment systems against accented speakers (see e.g. Kibishi and Nakagawa 2011, Loukina et al. 2015, and Gao et al. 2018.) Such biases prevent accurate speech-to-text and pronunciation scoring for the accented, including in high stakes assessments such as for immigration qualification (e.g., Australian Associated Press 2017, Ferrier 2017, Main and Watson 2022) forcing pronunciation assessment manufacturers to overhaul their offerings with transcription data capable of measuring genuine listener intelligibility. O’Brien et al. discuss this issue in “Directions for the future of technology in pronunciation research and teaching,” Journal of Second Language Pronunciation 4 (2):182-207, e.g. on page 186, stating, “pronunciation researchers are primarily interested in improving L2 learners’ intelligibility and comprehensibility, but they have not yet collected sufficient amounts of representative and reliable data (speech recordings with corresponding annotations and judgments) indicating which errors affect these speech dimensions and which do not.” [Emphasis added; see also their discussion starting on page 192, “Collecting data through crowdsourcing.”] Mozilla’s EM Lewis-Jong discussed the trade-off of the greater quantity of data collection using binary voting at the expense of the greater quality of transcriptions typed by listeners during the Q&A portion of this NVIDIA Speech AI Summit session.

Please see also my attempt to address this here four years ago.

@Em.Lewis-Jong, isn’t there enough quantity now that you can afford to replace voting with transcribing? Please consider the effects described in in particular. Thank you for your consideration of this issue.

-Jim Salsman
Mountain View

@gregor and @Em.Lewis-Jong, now that there is such a large volume of voted utterances, and nothing promised regarding accents ever addressed the underlying issue described above, can the transition be eased by allowing the vote table to hold both booleans and text strings? Perhaps by adding another column for at please?


Hi Jim.

Happy to speak more about this, and we’re always trying to evolve our approach!

Firstly - the community voting system is not about accent - it is specifically there to ensure audio recording quality, and word-to-utterance accuracy. Non-natively accented speech is actively and profoundly welcomed at Common Voice, and this is made clear throughout the contribution guidelines.

We have spent the last two years running various initiatives and camaigns aimed at improving speaker diversity in datasets, with a great deal of success. We have also been doing more to help educate new communities in on-boarding about the benefits of speaker diversity. If people reject a clip on the basis of accent, then they have made a mistake, and it is counter to a) the aims of the project, b) the specific guidelines we release, and c) the messaging in our mobilisation work.

We find that by having multiple rounds of user review for any given clip, individual user error like downvoting clips incorrectly can be minimised - and we still release all downvoted clips so that this data isn’t lost for usage.

We have already evolved our approach to accent metadata to make it even easier for people to express the richness of their accent. Accent metadata is freeform specifically to enable us to capture the diversity and variety of accents, including the accents of multilingual speakers.

To your point about transcription possibly being a less biased mechanism - we are in fact exploring using this approach for new data formats like spontaneous speech, likely to be released in Beta later in 2023! We will then have more concrete data on the benefits and risks of these different approaches.

If you’d like to be part of the community research group that looks at this data, we would welcome that!



I’d be happy to help, @Em.Lewis-Jong; just let me know how.

The problem with voting, beyond representing the quality of an entire utterance with a single bit, is that it’s always a subjective judgement as to whether a non-standard pronunciation is incorrect or merely an accent. Raters will apply their personal standard, not because they are ignoring your instructions, but because volunteers aren’t phonology experts, and are simply incapable of reliably discerning between accents and errors, even when a potential error would always be inconsequential. Volunteer raters simply don’t know, so even when they believe they are complying with messaging to tell them to allow accents, it’s literally impossible for them to do so. The coarse granularity of voting on the entire utterance means that aggregation doesn’t effectively alleviate this problem.

Downstream, dataset users are going to have downvotes and upvotes, but no information about the part of the utterance which was disapproved. Their speech recognition models will discriminate, as will their pronunciation assessment models.

Transcription solves both of those problems. Whether a listener’s transcript matches the intended speech is an objective measure of genuine intelligibility, and agregates in a way that the specific obstruction to listener comprehension is identified at the word level and often also at the phoneme level. It doesn’t require consulting a professional phonologist with an understanding of the L1 and L2 population involved to discern between accent and error. It also addresses the fact that heavily accented speech can be and often is incomprehensible to most listeners.

Regarding the use of geographic location to represent accent, frankly it’s completely insufficient. Accent involves not just location, but the background of the population in which a learner forms their cerebellar dexterity, including the same fractal backgrounds of their parents, teachers, peers, and the e.g. shopkeepers and officials with whom they must interact in their community, not to mention the media to which they listen. But representing an accent with a location will not cause bias or discrimination, primarily because there’s nothing model builders can use that information to do.

I’m absolutely fascinated by your spontaneous speech initiative. How can I learn more about it?

Dear Jsalsman,

Thank you for your interest in Common Voice.

It’s also worth noting that transcription introduces new problems, it requires speakers to be able to write in some normalised form as well as by able read.

This is not how it works, the accent submission is broadly free form, so people are able to self-identify their accents as they wish.

We would be very interested in concrete and quantitative substantiation for your statements regarding accents in Common Voice datasets if you have it available.

If you’re looking for a better way to characterize accents than geography or self-description, voiced phone formant characterization (e.g. as in is objective, data-based, and correlates well with ethnicities.

If you’re asking me to substantiate my assertion that nobody can use the accent characterizations you provide, I can’t prove a negative, and you’re in a better position to provide example(s) of how they are being used.

How can I get involved with the spontaneous speech initiatives?

I think the system we have based on self-description is a good start, of course it can be improved, there are many possible avenues for that. But one of the main benefits of that is that it does not require specialist input and only relies on speakers themselves, meaning it can scale arbitrarily (e.g. to thousands of languages). In addition it is based on speakers’ self-perception, meaning that we are not in the position to mis-identify or mis-classify people, and we do not exclude people by forcing them into pre-defined categories. What is done with that data later is a question for research. In terms of what is better there are many dimensions on which a particular approach can be better, but it is unlikely that a single approach is better on all of those dimensions.

One thing you could try is random sampling from the rejected clips and see what percentage are rejected because of accent rather than other reasons. That would be an interesting study to do.

I imagine that further information will be provided here and on Matrix later in the year.

1 Like

With all respect, are there any dimensions on which characterizing voiced phone formant ratios wouldn’t be objectively better? It too doesn’t require specialist input, and with the added advantage of not requiring any input other than speech from the speaker, so there is no opportunity for error in self-classification either. Sadly people are very often unable to identify their own accent with any precision in ways other laypeople have any hope of understanding. And of course it scales much easier, involving the automatic process of segmentation of which your own speech recognition systems are capable, followed by formant analysis e.g. UCL’s open source SFS: The formant ratios would be aggregated for each voiced phone in the utterance, and those characterizations would be k-means clustered to any degree of granularity you find useful for human-readable labeling, whether based on geographic locales, ethnicities, or preferably both.

I would love to put together a proof of concept if you’re interested. By classifying accents with their objective, quantitative measures, you’d be providing valuable information that downstream users would actually be able to make use of.

@Em.Lewis-Jong Here are a couple large-scale users of Common Voice data who could really use help from listener transcriptions:



Hi Jim and Francis.

Thanks for patience whilst I was on leave.

Jim - information about Spontaneous Speech will be made available later in the year, and people will be invited to submit their interest in engaging. We will be experimenting with transcriptions at that stage, and this will provide useful data for benchmarking different approaches. As Francis has mentioned - transcription as an approach is not a silver bullet for the wider problem of bias within machine learning, it will simply alter the nature of it. Exploring these questions is one of the many things that is being done with Common Voice by the hundreds of thousands of engineers and research teams who use the dataset, and provide regular input and feedback.

Your suggestions and perspectives on alternatives are well noted. Thanks for laying them out in detail!

Thanks also to everyone for making sure we live by our community participation guidelines

1 Like