As of today, Common Voice has the capability to collect voice data for a specific purpose or use-case. We’re putting this ability to the test and kicking off data collection for a single word target segment that will eventually enable 1) spoken digit recognition, 2) yes and no detection and 3) data for Hey Firefox wake word testing.
To make this happen the Common Voice platform will collect audio from contributors across various languages speaking the digits zero through nine, as well as the words yes, no, hey and Firefox. These 14 (single word) sentences will be prioritized for each contributor when they either Speak or Listen at Common Voice. In order to ensure a wide range of data in each language, we’ll limit recording of these sentences to just once per person, per language. We also recognize that listening to people say such short terms repeatedly may get boring and be mentally fatiguing. To avoid burnout, and ensure quality of contribution when listening to clips, each person will only receive a maximum of two sets (or 28 total) of these succinct recordings.
Why a target segment?
This targeted data collection will immediately benefit two collaborations: the first being with Mozilla Fellow, Josh Meyer, and the second being our teammates creating Firefox Voice.
Part of Josh’s work is to discover how much data is needed to train a machine learning engine on a new voice recognition application in a new language. For this work, Josh is aiming to benchmark the accuracy of Mozilla’s open source voice recognition engine, Deep Speech, in multiple languages for a similar task. Josh and the Deep Speech team have identified that spoken digit recognition, as well as yes and no detection, are great candidates for this type of application testing. The only caveat is that they need data to run those tests and, to quote Josh himself; “That data doesn’t exist…yet.”
Similarly, our Mozilla colleagues in Emerging Technologies are testing and training wake word options for Firefox Voice. They reached out, curious if Common Voice communities could help generate voice data for Hey Firefox in multiple languages.
By adding your voice to this target segment, you’re contributing to the work Josh, Deep Speech, Firefox Voice and Common Voice are doing – not to mention the people who will download this target segment and build voice recognition applications in various languages.
What languages will this be collected in?
Starting today this targeted data collection is available in 13 languages*. If it is available in your language(s) you’ll notice 1) a banner announcing it on the Common Voice website and 2) some added context on the sentence cards when contributing. It’s our goal to enable this collection in as many languages as possible. To do so we must first obtain all of the translated or transliterated** words for each language, verified by a native speaker. Once ready they will be merged in and made available for contribution at Common Voice. If you’re interested in helping contribute to the translation of these words in your language(s), visit this Githhub repository where you can submit a pull request or an issue for review.
Why is this important and what’s next?
As the Common Voice project grows in dataset size, community, and reach, it has become increasingly important for the platform to be able to distinguish the context of its collected data. Providing context, or a vocabulary of what the data relates to via tagging, allows for a more complete picture beyond language, accent, sex and age. This level of detail will allow contributors — both community members submitting recordings and sentences, as well as researchers and developers analyzing the final output — to select the segment that is the most relevant to them. This will enable more detailed feedback on how to continue improving the dataset, while also unlocking more possibilities for usefulness and application.
To further this work, the team will be exploring opportunities this new structure facilitates, including tagging at the clip level during the Listen phase. Our goal is to introduce more targeted data segments based on the content being recorded. Examples of this could be tagging background noise on a recorded clip or confirming that content is related to medicine or geography. By implementing tagging, both for sentences imported and of clips recorded, the data structure populated is more comprehensive and accurate. Stay tuned for that release announcement and go add your voice to Common Voice’s first target segment!
Megan + the Common Voice team
*Initial 13 languages: Arabic, Catalan, English, German, Spanish, French, Dutch, Polish, Portuguese, Russian, Tamil, Turkish, Tatar.
**For the collection of Hey as part of Hey Firefox, the Firefox Voice team wishes to utilize a transliteration approach to capture more natural and comfortable utterances. An example in French is Hé Firefox rather than Salut Firefox.