I've created a fully annotated version of Common Voice 7.0

Hello!

My name is Fredrik and I’ve used Common Voice for my master thesis “Language agnostic voice classification for conversational applications” where we classify the age and gender of the speakers in the Common Voice 7.0 corpus. In doing so we filtered out the voice clips without metadata, set a 15 s clip limit and kept a maximum of five clips per speaker. This reduced the corpus to 74 different languages, 43,255 unique speakers, 318 hours and 221,211 clips of recorded voice data. It’s a version of Common Voice that can be more easily used for speech processing tasks other than just ASR.

I’m happy to share my research, extensive data exploration and, of course, the data if anyone is interested in using it. It’s nothing crazy, but might save researchers some valuable time.

Hope you are having a good day!

Kind regards,
Fredrik Lastow

3 Likes

Hey Fredrik,

Welcome to the Common Voice Community Discourse and thanks for sharing how you are using the dataset.

I would be intrested in learning more about, do you have a repo or papers regarding your work ?

Many thanks,

Hillary

1 Like

Hello Hillary,

Yes, I’d be happy to! Unfortunately the code has not been made public yet, but there will exist a repo in the future. However, I can share with you my thesis where the dataset is explained in more detail. Is there some where I can reach you, an email perhaps?

Cheers,
Fredrik

1 Like

Hey Fredrik,

No worries thanks for the clarification -

My email is hillary@mozillafoundation.org.

I look forward to reading your thesis.