I've created a fully annotated version of Common Voice 7.0

Fredrik_Lastow · March 28, 2022, 8:09pm

Hello!

My name is Fredrik and I’ve used Common Voice for my master thesis “Language agnostic voice classification for conversational applications” where we classify the age and gender of the speakers in the Common Voice 7.0 corpus. In doing so we filtered out the voice clips without metadata, set a 15 s clip limit and kept a maximum of five clips per speaker. This reduced the corpus to 74 different languages, 43,255 unique speakers, 318 hours and 221,211 clips of recorded voice data. It’s a version of Common Voice that can be more easily used for speech processing tasks other than just ASR.

I’m happy to share my research, extensive data exploration and, of course, the data if anyone is interested in using it. It’s nothing crazy, but might save researchers some valuable time.

Hope you are having a good day!

Kind regards,
Fredrik Lastow

heyhillary · March 29, 2022, 10:21am

Hey Fredrik,

Welcome to the Common Voice Community Discourse and thanks for sharing how you are using the dataset.

I would be intrested in learning more about, do you have a repo or papers regarding your work ?

Many thanks,

Hillary

Fredrik_Lastow · March 29, 2022, 11:24am

Hello Hillary,

Yes, I’d be happy to! Unfortunately the code has not been made public yet, but there will exist a repo in the future. However, I can share with you my thesis where the dataset is explained in more detail. Is there some where I can reach you, an email perhaps?

Cheers,
Fredrik

heyhillary · March 29, 2022, 11:47am

Hey Fredrik,

No worries thanks for the clarification -

My email is hillary@mozillafoundation.org.

I look forward to reading your thesis.

Topic		Replies	Views
Watch now: Learn how people are using the Common Voice Dataset Using the common voice dataset	0	1179	December 7, 2021
2020 End-of-Year Common Voice Dataset Release Common Voice announcements	3	3332	December 22, 2020
Metadata File Only Common Voice learning	1	371	June 21, 2021
Multi-language Dataset Beta Release Common Voice announcements , dataset	23	5879	April 6, 2020
Common Voice Dataset V.11 Common Voice	5	2747	October 4, 2022

I've created a fully annotated version of Common Voice 7.0

Related topics