Is it possible to have a file of only the metadata from the speakers included in the Common Voice data? I would like to use the distributions of age and gender in a project related to inclusion of older persons data in ML datasets.
Would you like aggregated metadata ? And which language(s) are you interested in ?
Here is one command you could use:
itzpapalotl:~/CV/cv-corpus-6.1-2020-12-11$ for i in *; do cat $i/train.tsv | grep -v '^client_id' | cut -f6-8 | sort -f | uniq -c | sort -gr | sed 's/^ *//g' | sed "s/^/$i\t/g" ; done | grep -v '[ ][0-9][ ]'
The result would be something like this.
If you’d like more help in real time, feel free to join us on Matrix.