Help create Common Voice's first target segment

See this post in other languages: Español Français


As of today, Common Voice has the capability to collect voice data for a specific purpose or use-case. We’re putting this ability to the test and kicking off data collection for a single word target segment that will eventually enable 1) spoken digit recognition, 2) yes and no detection and 3) data for Hey Firefox wake word testing.

To make this happen the Common Voice platform will collect audio from contributors across various languages speaking the digits zero through nine, as well as the words yes, no, hey and Firefox. These 14 (single word) sentences will be prioritized for each contributor when they either Speak or Listen at Common Voice. In order to ensure a wide range of data in each language, we’ll limit recording of these sentences to just once per person, per language. We also recognize that listening to people say such short terms repeatedly may get boring and be mentally fatiguing. To avoid burnout, and ensure quality of contribution when listening to clips, each person will only receive a maximum of two sets (or 28 total) of these succinct recordings.

Why a target segment?

This targeted data collection will immediately benefit two collaborations: the first being with Mozilla Fellow, Josh Meyer, and the second being our teammates creating Firefox Voice.

Part of Josh’s work is to discover how much data is needed to train a machine learning engine on a new voice recognition application in a new language. For this work, Josh is aiming to benchmark the accuracy of Mozilla’s open source voice recognition engine, Deep Speech, in multiple languages for a similar task. Josh and the Deep Speech team have identified that spoken digit recognition, as well as yes and no detection, are great candidates for this type of application testing. The only caveat is that they need data to run those tests and, to quote Josh himself; “That data doesn’t exist…yet.”

Similarly, our Mozilla colleagues in Emerging Technologies are testing and training wake word options for Firefox Voice. They reached out, curious if Common Voice communities could help generate voice data for Hey Firefox in multiple languages.

By adding your voice to this target segment, you’re contributing to the work Josh, Deep Speech, Firefox Voice and Common Voice are doing – not to mention the people who will download this target segment and build voice recognition applications in various languages.

What languages will this be collected in?

Starting today this targeted data collection is available in 13 languages*. If it is available in your language(s) you’ll notice 1) a banner announcing it on the Common Voice website and 2) some added context on the sentence cards when contributing. It’s our goal to enable this collection in as many languages as possible. To do so we must first obtain all of the translated or transliterated** words for each language, verified by a native speaker. Once ready they will be merged in and made available for contribution at Common Voice. If you’re interested in helping contribute to the translation of these words in your language(s), visit this Githhub repository where you can submit a pull request or an issue for review.

Why is this important and what’s next?

As the Common Voice project grows in dataset size, community, and reach, it has become increasingly important for the platform to be able to distinguish the context of its collected data. Providing context, or a vocabulary of what the data relates to via tagging, allows for a more complete picture beyond language, accent, sex and age. This level of detail will allow contributors — both community members submitting recordings and sentences, as well as researchers and developers analyzing the final output — to select the segment that is the most relevant to them. This will enable more detailed feedback on how to continue improving the dataset, while also unlocking more possibilities for usefulness and application.

To further this work, the team will be exploring opportunities this new structure facilitates, including tagging at the clip level during the Listen phase. Our goal is to introduce more targeted data segments based on the content being recorded. Examples of this could be tagging background noise on a recorded clip or confirming that content is related to medicine or geography. By implementing tagging, both for sentences imported and of clips recorded, the data structure populated is more comprehensive and accurate. Stay tuned for that release announcement and go add your voice to Common Voice’s first target segment!

Cheers,

Megan + the Common Voice team


*Initial 13 languages: Arabic, Catalan, English, German, Spanish, French, Dutch, Polish, Portuguese, Russian, Tamil, Turkish, Tatar.

**For the collection of Hey as part of Hey Firefox, the Firefox Voice team wishes to utilize a transliteration approach to capture more natural and comfortable utterances. An example in French is Hé Firefox rather than Salut Firefox.

6 Likes

Hi, nice project,

just two questions about this part:

  1. I can’t find a column for the translation of Hey in the repository. How can we add translations?

  2. Right now in the German Version of CV the word “Hei” appears, a word that doesn’t exist. Hei can be pronounced both like Hi and like Hey. I think just saying Hey Firefox would be the best and most natural way to use it in German. If you really want the literal sound I would go with Hej.

3 Likes

Just added a pull request for the Danish numbers.

3 Likes

Hey all – I’m a Principal Scientist at Mozilla, and the person responsible for bringing Hey Firefox into the mix here. First, thanks everyone for your enthusiasm: it’s great to see.

I want to address two questions that have shown up. The first is around the transliteration vs. translation issue – that is to say, why are we collecting Hey Firefox instead of, say, Salut Firefox or Hola Firefox. That’s pretty easy: as you might expect, we’re interested in a wakeword that anyone can use, as part of the Firefox Voice project. For the system we are planning on using, we’re looking for something on the order of 2-4k recordings of Hey Firefox, from perhaps 1k people. That makes recording individualized, language-specific wakewords not viable at this stage. So that means we’re looking to show people something where they will read it out loud and say something that sounds like “Hey Firefox” which we can train from.

Some asked about the particular transliteration in German. For each of the languages we collected I checked with a native speaker about how they’d write it down. I could absolutely see differences of opinion here, and I’m entirely happy to discuss any particular transliteration. We’ll look into the right option for German; thank you for bringing it up.

Thanks again

Jofish

5 Likes

Hi Jofish,
thanks for the explanation. The thing with Hei is that ei is a tricky compinaton in German. You want us to pronounce e and i separately, but normally these two letters are pronounced like ay. You can try out the different pronounciations here on google translate by clicking on the listen icon.

I would just go with Hey, it is a common word in German, and you can find it in the Duden.

EDIT: maybe we just wait a little and see how the people actually pronounce it. I would pronounce it like Hi if I wouldn’t know better.

EDIT: Just had a first woman pronouncing it like “Hi”. I will hit “no” if this happens.

EDIT: I used private tabs to see a little more of this project and I got 1 out of seven donators who pronounced it like Hey, everyone else said Hi.

4 Likes

I like data-driven decision making. If we’re getting only 1/7 pronouncing it like we expected, then I entirely agree; let’s change it to Hey.

5 Likes

Data tagging is a really great feature that could improve the dataset a lot. I had some questions though:

  1. Will this data be mixed into all the other clips in a dataset release or will target segments be considered separate datasets?

  2. To clarify, this is just to test the feature and there isn’t a specific intended use-case for gathering numbers?

2 Likes

Thanks @dabinat, great questions.

We’re aiming to have target segments downloadable as separate dataset segments. So this Single Word target segment would be made available separately from what we refer to as the General Corpus. The intent is for these release cycles to be linked. Meaning, we’re planning to release the Single Word segment data at the same time as the General Corpus data, with a release target of mid-year. (Note that we’ve never done a release under the bandwidth constraints imparted by the current pandemic situation and timelines are not concrete.)

Currently spoken digit collection is intended for benchmark testing an application on digit recognition. No other specific use-cases are outlined at the moment.

2 Likes

Hello :slight_smile:

The pull request of native verification for Indonesian language just merged and I am wondering how long it will takes to be updated in the Common Voice site?

3 Likes

We usually do portal releases each 2 weeks, we have one planned tomorrow Tuesday 26th, but I don’t know if the new languages will be merged into the main repo by today.

@Joshua_Meyer how often you merge the new languages into voice-web repo? Can you make a PR today before tomorrow’s release?

Hi, how should we do if we had multiple different words in correspond to yes/no? should we list them all in P/R?

Such as currently Japanese had listed two type in each words, how will we deal with this in Common Voice?

1 Like

(From another native speaker)
“Hei” is phonetically wrong if you assume Standard German, and lexically awkward.

If the spoken output you want is the word “Hey”, you should write it this way.

“Häi” or “Hej” would be (more) phonetically correct, but they look awkward and unfamiliar to German speakers, and everybody is familiar with anglicisms.
Additionally, “Hey” (in this exact form) is used frequently in everyday speech and writing (e.g. chats), making “Hei” look more like a spelling mistake.

To clarify: “Hei” looks like a bad translation, and is alien to written Standard German.

3 Likes

@irvin We’re working with @Joshua_Meyer to understand if multiple words corresponding to one word is an acceptable way forward for this benchmark test set. At the moment, we’ve only merged languages that have a 1-to-1 translation and it’s my understanding that Josh is validating the best corresponding word in each language. If you have multiples, for now it’s okay to list them all in the PR here and work with Josh to determine the best way forward for that language set.

1 Like

@tucan welcome and thanks for that input. Sounds like you and @stergro are aligned re: the German preference for “Hey” over “Hei”. Appreciate that feedback. We’ll be sure the Firefox Voice team gets this info and work with @jofish to determine best way forward prior to our next release (June 9th). Update: This issue has been submitted to Common Voice github for transition of Hei to Hey for German collection. This will go out with the June 9th release.

Also @stergro it’s great that you’re taking on validating more via a private tab. For recording, it’s important that we have as many unique voices as possible, so please avoid recording these target segment words more than once.

1 Like

@mbranson Is there a particular goal or minimum in terms of number of contributors or hours recorded?

Do the progress bars on the Languages page count target segment contributions as part of the total? They shouldn’t if it’s a separate dataset.

Just a thought. I say digits a little differently when I am speaking them carefully one at a time, compared to reeling them off in a long multi digit number.

Are there other languages seeing people reading the footnote?

Thanks for the questions and input @dabinat!

  • For single digits, yes and no we’re aiming to collect 4k validated utterances (clips) from at minimum 350 unique speakers

  • For hey and Firefox we’re aiming to collect 2-4k validated utterances (clips) from at minimum 350 unique speakers

Yes, they do. This Single Word target segment is a part our of Common Voice Dataset and not a separate dataset itself. Therefore it adds to the overall collection numbers for the Common Voice Dataset.

Part of this work is determining how we represent progress toward segments as part of the dataset whole. Breaking this down by language is indeed another factor. Our current priority is collecting these clips, releasing the data and gathering insights before we make any sweeping changes to how progress is conveyed.