Hey Common Voice Community,
We would like to invite you to attend our Ask Me Anything (AMA) session on Dataset release with Jenny, Lead Engineer of Common Voice.
You can ask questions to Jenny about the dataset release via this topic from 4th August 3pm-4 pm UTC.
Any questions, we are unable to answer live will be followed up with on a later date. Please abide by the Community Participation Guidelines, when proposing questions.
We look forward to answering your questions
This topic is live, to see Jenny’s responses please click on the arrow for the corresponding question
Question 1: What exactly are the goals and objectives of the cooperation with NVIDIA? Are they simply supporting the project as it is, or will the project shift to a certain direction or a concrete plan?
Response 1 from Jenny
From my perspective there’s two big goals to the partnership: 1) to increase Common Voice’s (CV) funding stability and diversity so that the team can invest in growth, and 2) to tighten the feedback loop between data collection and machine learning teams that actually use the data. The CV team has always wanted more visibility into how to measure the quality/usability of the data that’s being collected, and those metrics can be so variable depending on the use case. When the Mozilla Corporation restructure happened last fall, NVIDIA was one of the orgs that reached out to the team about supporting the project, and it was definitely the best fit in terms of how the internal machine learning teams were already making use of CV.
Mozilla Foundation owns the overall vision and governance of the Common Voice project, and NVIDIA is only one of the partners that Mozilla Foundation is working with. While all stakeholders - including Gates, Giz, FCDO, Nvidia - are going to have their own goals, the CV team makes sure any funds invested onto the platform benefit all language communities and CV stays true to its core ethos of democratizing voice tech for all. And hey, NVIDIA wouldn’t have reached out to support the project if they didn’t think the project was already great - CV will always be open-source, and the CV dataset will always remain in the public domain. As for a concrete plan for what upcoming features and plans look like, watch out for the community roadmap sessions!
Response 2 from Jenny
The short answer is: keep doing what you’re doing! The amount of energy and enthusiasm we’ve been seeing on this project especially in the last six months has been tremendous, and I’ve been personally really excited and heartened to see this show of confidence in the future of the project. The slightly longer answer is that Common Voice is not so much one community as it is a hundred communities in a trench coat, and the CV team doesn’t (and can’t) have visibility into all of those communities. The more that you can all support one another by sharing tips around how to grow your own communities, for example by answering recurring questions in Github or Discourse and by sharing ideas and assets for running community events, the more Common Voice becomes self-sustaining. I know many of you are already doing that - thank you! We’re also working on expanding the ways the community can get involved in influencing the future of CV - Hillary has been planning some new and exciting ways of engaging (such as this AMA), so stay tuned for more via the weekly updates!
Question 3: Now that there are more languages with big datasets, are you planning to train more models, or will this stay something that the community has to do themselves? A central repository for available models also would be extremely useful.
Response 3 from Jenny
To be totally frank, model training is not something the Common Voice team has any expertise at, and building the internal capacity to train models doesn’t seem like the best use of our limited time and energy. I want Common Voice to be the catalyst for a super robust and accessible voice tech ecosystem, but that doesn’t mean we have to do every part of it ourselves. We’re good at building interfaces for collecting data, so that’s what we’re going to keep focusing on, but think of the dataset we build as the farm that grows the healthy ingredients for voice tools, while the community and stakeholders figure out the actual recipes and meals.
This is one of the benefits of the partnership with NVIDIA - the NeMo team, which builds NVIDIA’s open-source conversational AI toolkit, plans on training models with more CV languages and making those publicly available. Our partnerships with partners like Gates and Giz on African languages including Kiswahili and Luganda also involves model creation, which again will be open-sourced. We’re also investigating how to make this and other model resources more accessible and easily discoverable to the community on the platform itself - we don’t want to just throw a list of resources into a Github repo!
Question 4: I’m wondering if there is any established workflow to deal with the sentences in reported.tsv. As an example, for Belarusian we have 5-6% problematic sentences in the Wikipedia export, and many of them have been reported by the speakers so far (although both precision and recall of reporting are not perfect, i.e. some reported sentences are OK, and some problematic sentences have never been reported)…
Response 4 from Jenny
There isn’t an established workflow, no, these sorts of sentence governance questions have been taken on by the community on a case-by-case basis. If the sentences are coming from the Wikipedia export, feel free to prepare a PR to remove them directly on the common-voice repo (though this will not impact any clips that have already been reported with those sentences). If the sentences are coming from the Sentence Collector, you’ll need to submit a PR to the sentence-collector repo instead. Feel free to also open an issue in the common-voice repo to discuss further, if you want input from other community members!
Question 5 (related 4): Tangentially to the above, comments in reported.tsv for Belarusian, which were filled in by the contributors, are not displayed correctly: all Cyrillic characters have been replaced with question marks (probably an encoding issue at some stage of the data pipeline). Should we file an issue in the common-voice repo, or is it already on the radar?
Response 5 from Jenny
Nope, good catch! Can you actually file an issue in the common-voice-bundler repo, as that’s the tool we use for creating each dataset release? If this is impacting Belarusian I suspect it’s also impacting other languages and I’d like to take a closer look.
Added after event
Question 6: After downloading the Belarusian dataset, we found that the total duration of all recordings is larger than announced on the Common Voice website: 356 hours actually vs. 325 hours indicated in the website statistics as of 2021-07-29 (or even less on 2021-07-21 when the dataset was created). Is it true that, for statistic purposes, the total duration is calculated with certain limitations, e.g. dropping silence at the beginning/end of each clip, or dropping invalidated clips?
Response 6 from Jenny
This is actually because the stats on the website are not exact values, but rather estimates based on the average length of each clip for each language. As Belarusian hadn’t been part of a previous release, the estimate was based on the overall dataset average, which was around 4.7s. The average clip length for Belarusian in dataset 7 was actually around 5.4s, which is quite a big gap. These new averages have been added to the website and are now reflected on those graphs, and should match more closely what you’re seeing in the dataset!
Response 7 from Jenny
I took a quick look at the db and it looks like most of the votes those clips received were clustered pretty closely together, and most of them were from late last year or earlier this year. I don’t know this for certain, but my best guess is that this happens when a language is running low on unvalidated clips and the same set of clips got served up to multiple validators at the same time, so they were still in cache even though someone else may have already voted on them. The one with 7 votes is especially baffling, because the clusters of people disagreed on whether that clip was valid, so it kept getting served up.