Hello and apologies for the extremely long and potentially unusual topic name.
My name is Johann Diedrick. I am an artist, engineer, and musician who recently received a Mozilla Creative Media Award to create an interactive web experience called Dark Matters, which will spotlight the absence of Black speech in datasets used to train consumer voice technology. You can read more about the project here: https://foundation.mozilla.org/en/blog/announcing-8-projects-examining-ais-relationship-with-racial-justice/
Over the next few months I will be diving into the CommonVoice’s English corpus to learn more about its composition. Is there a resource that goes deeper into how each of the .tsv files were constructed? I understand some of their intended uses (training, validation, test, etc.), but I’d love to learn more about some of the other .tsv files (reported, invalidated, dev, and other) and schema construction.
Thank you for any pointers on where to start digging in, and I’m looking forward to contributing to and learning from this community!