Introduction / Mozilla Creative Media Awards / Dark Matters / Intro to CommonVoice Dataset

aquietlife · May 19, 2021, 12:45am

Hello and apologies for the extremely long and potentially unusual topic name.

My name is Johann Diedrick. I am an artist, engineer, and musician who recently received a Mozilla Creative Media Award to create an interactive web experience called Dark Matters, which will spotlight the absence of Black speech in datasets used to train consumer voice technology. You can read more about the project here: https://foundation.mozilla.org/en/blog/announcing-8-projects-examining-ais-relationship-with-racial-justice/

Over the next few months I will be diving into the CommonVoice’s English corpus to learn more about its composition. Is there a resource that goes deeper into how each of the .tsv files were constructed? I understand some of their intended uses (training, validation, test, etc.), but I’d love to learn more about some of the other .tsv files (reported, invalidated, dev, and other) and schema construction.

Thank you for any pointers on where to start digging in, and I’m looking forward to contributing to and learning from this community!

Best,
Johann

ftyers · May 19, 2021, 5:32pm

Dear Johann. You can look at:

The TSV schema should be fairly clear from the headings:

client_id	path	sentence	up_votes	down_votes	age	gender	accent	locale	segment

reported = clip was reported as ungrammatical etc.
validated = clip received 2 upvotes
invalidated = clip received 2 downvotes (I think)
dev = development data, used for choosing the best model
test = test data (used for testing the model, not used in training and not used to pick the best model)
train = training data
Feel free to join us on Mozilla’s Matrix

Topic		Replies	Views
Explainations about reported.tsv, other.tsv, etc Common Voice learning , dataset	0	1520	October 16, 2020
NEW HERE? This forum is for Voice-AI Dataset Creation project called "Common Voice" Common Voice	1	998	October 4, 2024
Where is the documentation regarding Mozilla Common Voice database? Common Voice issue	9	2757	January 24, 2022
Dataset release: MCV 14 Common Voice	5	954	July 11, 2023
Inadequate Documentation Common Voice documentation	9	1579	September 23, 2022

Introduction / Mozilla Creative Media Awards / Dark Matters / Intro to CommonVoice Dataset

Related topics