Introduction / Mozilla Creative Media Awards / Dark Matters / Intro to CommonVoice Dataset

Hello and apologies for the extremely long and potentially unusual topic name.

My name is Johann Diedrick. I am an artist, engineer, and musician who recently received a Mozilla Creative Media Award to create an interactive web experience called Dark Matters, which will spotlight the absence of Black speech in datasets used to train consumer voice technology. You can read more about the project here:

Over the next few months I will be diving into the CommonVoice’s English corpus to learn more about its composition. Is there a resource that goes deeper into how each of the .tsv files were constructed? I understand some of their intended uses (training, validation, test, etc.), but I’d love to learn more about some of the other .tsv files (reported, invalidated, dev, and other) and schema construction.

Thank you for any pointers on where to start digging in, and I’m looking forward to contributing to and learning from this community!


1 Like

Dear Johann. You can look at:

The TSV schema should be fairly clear from the headings:
client_id	path	sentence	up_votes	down_votes	age	gender	accent	locale	segment

reported = clip was reported as ungrammatical etc.
validated = clip received 2 upvotes
invalidated = clip received 2 downvotes (I think)
dev = development data, used for choosing the best model
test = test data (used for testing the model, not used in training and not used to pick the best model)
train = training data
Feel free to join us on Mozilla’s Matrix :slight_smile: