Common Voice and accent choice: new paper about accents in Common Voice

kathyreid · October 31, 2023, 2:12pm

Hi everyone,

This is a paper I recently published about the accents in Common Voice - it looks at how people describe their own accents in Common Voice English. The code is openly available (linked in the paper).

https://dl.acm.org/doi/10.1145/3617694.3623258

@inproceedings{10.1145/3617694.3623258,
author = {Reid, Kathy and Williams, Elizabeth T.},
title = {Common Voice and Accent Choice: Data Contributors Self-Describe Their Spoken Accents in Diverse Ways},
year = {2023},
isbn = {9798400703812},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3617694.3623258},
doi = {10.1145/3617694.3623258},
abstract = {The use of machine learning (ML)-powered speech technologies has increased significantly in recent years&nbsp;[40, 56, 72]. The datasets used for training speech models often represent demographic features of the speaker – such as gender, age, and accent. These axes are frequently used to evaluate the training set and model for bias&nbsp;[52]. Here, we focus on how accent is represented in voice data due to the adverse consequences of accent bias. We perform document analysis on several voice datasets to identify how accents are currently represented. We then analyse and visualise speaker-described accents from Mozilla’s Common Voice (CV) v13 English dataset, forming an emergent taxonomy of accent descriptors. We repeat this process using the CV v13 Kiswahili dataset, demonstrating that the taxonomy has use beyond English. We find that accents are currently represented in ways that are geographically, and predominantly, nationally bound. While this pattern is also shown in speaker-described accents from CV, a more diverse set of descriptors is revealed. This work provides some early evidence for re-thinking how accents are represented in datasets intended for ML applications. Our tooling is open-sourced, and we invite further work that uses our taxonomy to assess accent bias in speech data and models.},
booktitle = {Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization},
articleno = {35},
numpages = {10},
keywords = {accent data, dataset documentation, accent recognition, datasets, bias corpora, data visualization, metadata, speech data, voice data, bias, accent bias},
location = {Boston, MA, USA},
series = {EAAMO '23}
}

gina · November 2, 2023, 7:19am

Thanks Kathy, this is going to be an interesting read.

Topic		Replies	Views
Bias against accented speech from voting instead of transcribing Common Voice	9	890	February 3, 2023
Thoughts on accents Common Voice	3	843	October 27, 2021
Common Voice languages and accent strategy v5 Common Voice announcements	13	5640	August 4, 2021
:speaking_head: Feedback needed: Languages and accents strategy Common Voice participation , feedback	54	7432	March 25, 2020
Privacy concerns about dataset metadata Common Voice dataset	7	2788	May 16, 2019

Common Voice and accent choice: new paper about accents in Common Voice

Related topics