Dataset Release AMA Thread (Active: 4th August 3-4pm UTC)

heyhillary · August 19, 2021, 4:04pm

Hey Common Voice Community,

We would like to invite you to attend our Ask Me Anything (AMA) session on Dataset release with Jenny, Lead Engineer of Common Voice.

You can ask questions to Jenny about the dataset release via this topic from 4th August 3pm-4 pm UTC.

Any questions, we are unable to answer live will be followed up with on a later date. Please abide by the Community Participation Guidelines, when proposing questions.

We look forward to answering your questions

This topic is live, to see Jenny’s responses please click on the arrow for the corresponding question

Question 1: What exactly are the goals and objectives of the cooperation with NVIDIA? Are they simply supporting the project as it is, or will the project shift to a certain direction or a concrete plan?

Response 1 from Jenny

From my perspective there’s two big goals to the partnership: 1) to increase Common Voice’s (CV) funding stability and diversity so that the team can invest in growth, and 2) to tighten the feedback loop between data collection and machine learning teams that actually use the data. The CV team has always wanted more visibility into how to measure the quality/usability of the data that’s being collected, and those metrics can be so variable depending on the use case. When the Mozilla Corporation restructure happened last fall, NVIDIA was one of the orgs that reached out to the team about supporting the project, and it was definitely the best fit in terms of how the internal machine learning teams were already making use of CV.

Mozilla Foundation owns the overall vision and governance of the Common Voice project, and NVIDIA is only one of the partners that Mozilla Foundation is working with. While all stakeholders - including Gates, Giz, FCDO, Nvidia - are going to have their own goals, the CV team makes sure any funds invested onto the platform benefit all language communities and CV stays true to its core ethos of democratizing voice tech for all. And hey, NVIDIA wouldn’t have reached out to support the project if they didn’t think the project was already great - CV will always be open-source, and the CV dataset will always remain in the public domain. As for a concrete plan for what upcoming features and plans look like, watch out for the community roadmap sessions!

Question 2: What can we/the community do to make your life easier?

Response 2 from Jenny

The short answer is: keep doing what you’re doing! The amount of energy and enthusiasm we’ve been seeing on this project especially in the last six months has been tremendous, and I’ve been personally really excited and heartened to see this show of confidence in the future of the project. The slightly longer answer is that Common Voice is not so much one community as it is a hundred communities in a trench coat, and the CV team doesn’t (and can’t) have visibility into all of those communities. The more that you can all support one another by sharing tips around how to grow your own communities, for example by answering recurring questions in Github or Discourse and by sharing ideas and assets for running community events, the more Common Voice becomes self-sustaining. I know many of you are already doing that - thank you! We’re also working on expanding the ways the community can get involved in influencing the future of CV - Hillary has been planning some new and exciting ways of engaging (such as this AMA), so stay tuned for more via the weekly updates!

Question 3: Now that there are more languages with big datasets, are you planning to train more models, or will this stay something that the community has to do themselves? A central repository for available models also would be extremely useful.

Response 3 from Jenny

To be totally frank, model training is not something the Common Voice team has any expertise at, and building the internal capacity to train models doesn’t seem like the best use of our limited time and energy. I want Common Voice to be the catalyst for a super robust and accessible voice tech ecosystem, but that doesn’t mean we have to do every part of it ourselves. We’re good at building interfaces for collecting data, so that’s what we’re going to keep focusing on, but think of the dataset we build as the farm that grows the healthy ingredients for voice tools, while the community and stakeholders figure out the actual recipes and meals.

This is one of the benefits of the partnership with NVIDIA - the NeMo team, which builds NVIDIA’s open-source conversational AI toolkit, plans on training models with more CV languages and making those publicly available. Our partnerships with partners like Gates and Giz on African languages including Kiswahili and Luganda also involves model creation, which again will be open-sourced. We’re also investigating how to make this and other model resources more accessible and easily discoverable to the community on the platform itself - we don’t want to just throw a list of resources into a Github repo!

Question 4: I’m wondering if there is any established workflow to deal with the sentences in reported.tsv. As an example, for Belarusian we have 5-6% problematic sentences in the Wikipedia export, and many of them have been reported by the speakers so far (although both precision and recall of reporting are not perfect, i.e. some reported sentences are OK, and some problematic sentences have never been reported)…

Response 4 from Jenny

There isn’t an established workflow, no, these sorts of sentence governance questions have been taken on by the community on a case-by-case basis. If the sentences are coming from the Wikipedia export, feel free to prepare a PR to remove them directly on the common-voice repo (though this will not impact any clips that have already been reported with those sentences). If the sentences are coming from the Sentence Collector, you’ll need to submit a PR to the sentence-collector repo instead. Feel free to also open an issue in the common-voice repo to discuss further, if you want input from other community members!

Question 5 (related 4): Tangentially to the above, comments in reported.tsv for Belarusian, which were filled in by the contributors, are not displayed correctly: all Cyrillic characters have been replaced with question marks (probably an encoding issue at some stage of the data pipeline). Should we file an issue in the common-voice repo, or is it already on the radar?

Response 5 from Jenny

Nope, good catch! Can you actually file an issue in the common-voice-bundler repo, as that’s the tool we use for creating each dataset release? If this is impacting Belarusian I suspect it’s also impacting other languages and I’d like to take a closer look.

Added after event

Question 6: After downloading the Belarusian dataset, we found that the total duration of all recordings is larger than announced on the Common Voice website: 356 hours actually vs. 325 hours indicated in the website statistics as of 2021-07-29 (or even less on 2021-07-21 when the dataset was created). Is it true that, for statistic purposes, the total duration is calculated with certain limitations, e.g. dropping silence at the beginning/end of each clip, or dropping invalidated clips?

Response 6 from Jenny

This is actually because the stats on the website are not exact values, but rather estimates based on the average length of each clip for each language. As Belarusian hadn’t been part of a previous release, the estimate was based on the overall dataset average, which was around 4.7s. The average clip length for Belarusian in dataset 7 was actually around 5.4s, which is quite a big gap. These new averages have been added to the website and are now reflected on those graphs, and should match more closely what you’re seeing in the dataset!

Question 7: Hi, a short question, at least in the Czech CSVs there are sentences with more votes than three (at least up to six), not many of those, but still: what can be the cause?

Response 7 from Jenny

I took a quick look at the db and it looks like most of the votes those clips received were clustered pretty closely together, and most of them were from late last year or earlier this year. I don’t know this for certain, but my best guess is that this happens when a language is running low on unvalidated clips and the same set of clips got served up to multiple validators at the same time, so they were still in cache even though someone else may have already voted on them. The one with 7 votes is especially baffling, because the clusters of people disagreed on whether that clip was valid, so it kept getting served up.

heyhillary · July 14, 2021, 10:54am

heyhillary · July 14, 2021, 10:54am

heyhillary · August 3, 2021, 3:11pm

heyhillary · August 3, 2021, 3:24pm

stergro · August 4, 2021, 12:16pm

What exactly are the goals and objectives of the cooperation with NVIDIA? Are they simply supporting the project as it is, or will the project shift to a certain direction or a concrete plan?
In the past, there was a feature on CV that showed people how many percent of their contributions got accepted. I think it got removed for performance reasons. Any chance that this feature will return? It is a very useful feedback for users, especially when you donate to a foreign language with an accent.
What can we/the community do to make your life easier?

I also have some Deepspeech related question (not sure if you can answer them):

AFAIK the complete Deepspeech team left Mozilla and founded the startup https://coqui.ai . They plan useful things like updating the system to Tensorflow 2. Do you want to cooperate with them, or are there any plans to keep developing Deepspeech (and Mozilla TTS) independently inside of Mozilla?
Now that there are more languages with big datasets, are you planing to train more models, or will this stay something that the community has to do themselves? A central repository for available models also would be extremely useful.

mytmpaccount2015 · August 4, 2021, 11:16am

Hi @heyhillary @phire, thank you for creating this thread. I have several questions related to the dataset release:

(1) After downloading the Belarusian dataset, we found that the total duration of all recordings is larger than announced on the Common Voice website: 356 hours actually vs. 325 hours indicated in the website statistics as of 2021-07-29 (or even less on 2021-07-21 when the dataset was created). Is it true that, for statistic purposes, the total duration is calculated with certain limitations, e.g. dropping silence at the beginning / end of each clip, or dropping invalidated clips?

(2) I’m wondering if there is any established workflow to deal with the sentences in reported.tsv. As an example, for Belarusian we have 5-6% problematic sentences in the Wikipedia export, and many of them have been reported by the speakers so far (although both precision and recall of reporting are not perfect, i.e. some reported sentences are OK, and some problematic sentences have never been reported). Could we e.g. prepare a PR, based on reported.tsv, to remove known problematic sentences from the site data, so that they no longer would be available for recording? Just wondering if this kind of manual patching is the right way to go, consistent with other proposed improvements, such as the automated workflow to run extraction from newly-created Wikipedia articles, outlined by @mkohler here.

(3) Tangentially to the above, comments in reported.tsv for Belarusian, which were filled in by the contributors, are not displayed correctly: all Cyrillic characters have been replaced with question marks (probably an encoding issue at some stage of the data pipeline). Should we file an issue in the common-voice repo, or is it already on the radar?

Thanks in advance for any comments.

comodoro · August 4, 2021, 2:52pm

Hi, a short question, at least in the Czech CSVs there are sentences with more votes than three (at least up to six), not many of those, but still: what can be the cause?

heyhillary · August 4, 2021, 3:04pm

Hey everyone !

Thanks for your questions so far.

The thread is now live, Jenny is currently reading your questions and will be responding shortly.

phire · August 4, 2021, 4:04pm

Thanks so much for your thoughtful questions folks, I really enjoyed answering them! I’ve got to head off, but we’ll follow up later this week with answers to all the other technical questions we didn’t get to. As always, feel free to ping us in the Github repo - you know where to find us

stergro · August 4, 2021, 6:10pm

Thanks for your answers, Jenny! I learned quite a bit.

Next time, it may be better to give the people a little more time to ask questions. Many folks only visit this forum once a week or so.

heyhillary · August 5, 2021, 8:44am

Hey Stefan,

Thanks so much for your feedback. We defiently agree and we hope to implement any feedback to improve the experince for AMAs and Community engagements.

heyhillary · August 19, 2021, 4:05pm

Hey everyone, Jenny has been responding to some of the questions that we werent able to answer during the session. I have updated the topic to include them.