All Hands > Community Update

The Common Voice staff at Mozilla had the great opportunity to sit down with some of the community during Mozilla’s All Hands in Whistler. During the meetings we were able to have in depth conversations about the product needs, metrics and where we see innovations to the product happening in the next 6 months. Community feedback and contribution is integral to this project and we want to hear feedback from those that were not in the room with us.

Reporting Tool

For those of you who contribute to Common Voice regularly, you may have noticed some errors in sentences. We discussed putting up a reporting tool in Common Voice and we have successfully launched it! You can see a “Report” button in the speak and validate contribution sections. This will allow people to alert us if there are grammar errors, words from a different language or other inaccuracies.

Community Metrics

One of the hot topics that came up was the ability and need for the community to be able to better see and understand what their dataset looks like. This means, not only how many people are contributing and how many clips are recorded but also answering questions such as, what should we be focusing on in contributions? how many sentences do we have left to record? and How many sentences are skipped? These are just a few of the many data points community members would like to be aware of. We are working to roll out an MVP of community data that we can build on to help continuously answer questions about language velocity and contribution.

Partner Experiments

Common Voice is actively looking to partner with organizations that are also looking to enhance their voice data collection. While the team is still deciding what these partnerships look like, we are working to engage employees in donation as well as have companies outside of Mozilla push outreach. The goal of this would be to increase velocity in multiple languages. This is going to be one of our main focuses in the second half of the year.

Show Impact of the data

The Community would like to see what is possible with the data that we are collecting. This could be implementations in products or the ability to see a model in action using only the Common Voice dataset. This would make contribution feel more meaningful and give the community something tangible to work toward. We are currently evaluating the scope of this and understanding if this is something we can test quickly.

Build a Best Practices on Working with the Community.

The Common Voice team hears and understands that we need to be better about interacting and including each part of the community. Over the past six months we have strived to make happier and healthier community engagement and this will continue. We will work to more quickly expose our decisions for input as well as give direction on what parts of Common Voice need the most help. This topic is just an example on the kind of more regular and transparent communications we want to do.

Wikipedia Data Extraction

A blocker for many languages getting online is the lack of available CC0 sentences. With the help of the community, we are now able to pull wikipedia sentences in a way that allows them to be classified as CC0 content. This tool is still in progress and will be released to the community once it is ready.

We look forward to your continued feedback!

-The Common Voice Team

7 Likes

These are all good ideas. One that I would add to the list is error reduction - i.e. focusing on reducing the number of clips that end up failing validation.

I feel like this requires a combination of both educational and technical features. Users would benefit from increased guidance on how to record and validate. While there are community guidelines, there need to be official Mozilla-sanctioned guidelines that are easily locatable on the main Common Voice site.

On the technical side of things, a lot of recording problems would be solved if users played their own recordings back before submitting. Users apparently don’t know about core features like playing back recordings and skipping sentences, so there is clearly some kind of UX issue around feature discovery.

Additionally, more feedback in the UI about technical issues like microphone volume would help to combat the large number of low volume recordings.

1 Like

Hopefully we will get full clarity on rejected recordings one we have the datapoints/metrics enabled and accessible. If that’s a big problem we can probably prioritize its analysis and implement more granular solutions.

Stats will certainly provide metrics on how to improve, but my point was that at the scale that Common Voice needs to operate, even low error rates can add up to big numbers. For example, an error rate of 5% over 10,000 hours would be 500 hours. English seems to validate around 1-1.5 hrs of content per day, so this error rate would add an extra year or more to the project’s goal.

So if, say, adding a recording volume meter resulted in 1% fewer rejections, that could add up to a big amount in the long term. That’s why I think it should be a goal irrespective of the actual current rate of failure (although, for the record, I suspect this is much much higher than 5%).

Yes, that’s what I think it’s interesting to fully confirm with data and understand what that means for the volume we want to achieve. Fully agree on considering volume and future when analyzing the problems :+1:

Hi,

Collecting 2000 hours of voice clips is a long path. It’s really difficult keep comunity active or growing for years. So, shown impact of collected data is a must.

Often question are about guidelines for listening. So, I suggest adding a help button in speak and listen windows, pointing to community language guidelines for speaking and listening.

About collecting sentences. Wikipedia data extraction is a great improve. But it lack speech sentences. I wonder if similar criteria (get 1-3 sentences by item) can be applied to other sources (opensubtitles.org, TEDtalks captions, epub files,…)

2 Likes

As soon as we have the wikipedia script/process fully automated, we want to start a conversation again with Mozilla’s legal team to explore and understand this too.

1 Like

I looked into OpenSubtitles once and couldn’t find any information on their site about the copyright status of the subtitles if the underlying movie is in the public domain.

@dabinat This would be the perfect dataset to get new sentences, the amount of sentences is huge.

Hi,
It is difficult to determine the category of refusals unless it is necessary to listen again.

I started making slides to talk about the mistakes I noticed. Here is a first list if it helps:

  • the words added
  • forgotten words
  • the words that we hang on to and pull ourselves together
  • A letter forgotten in a word (changes the meaning of the sentence)
  • stopping the recording before the end
  • Syllable inversion
  • The text does not correspond to the language
  • Voice problem due to equipment (too much noise or wind tunnel)