Metadata File Only

Sara_Jordan · June 21, 2021, 5:36pm

Is it possible to have a file of only the metadata from the speakers included in the Common Voice data? I would like to use the distributions of age and gender in a project related to inclusion of older persons data in ML datasets.

ftyers · June 21, 2021, 10:26pm

Would you like aggregated metadata ? And which language(s) are you interested in ?

Here is one command you could use:

itzpapalotl:~/CV/cv-corpus-6.1-2020-12-11$ for i in *; do cat $i/train.tsv | grep -v '^client_id' | cut -f6-8 | sort -f | uniq -c | sort -gr | sed 's/^ *//g' | sed "s/^/$i\t/g" ; done  | grep -v '[       ][0-9][         ]'

The result would be something like this.

If you’d like more help in real time, feel free to join us on Matrix.

Topic		Replies	Views
Age meta data Common Voice	9	1416	October 4, 2022
I've created a fully annotated version of Common Voice 7.0 Common Voice	3	430	March 29, 2022
Common Voice Toolbox: Updated with CV v22.0 data Common Voice feedback , tooling	20	3460	November 19, 2025
Informationen über de_538h_2019-12-10 Deutsch (de)	2	666	July 9, 2020
Older English dataset question Common Voice dataset	6	1526	June 15, 2021

Metadata File Only

Related topics