As you know Common Voice is our initiative to build open and publicly available datasets of labelled audio, i.e. voice recordings connected to transcribed text, that anyone can use to train voice-enabled applications. Or phrased differently: to help teach machines to understand the human voice, regardless of language, accent, age or gender.
To do this we need to collect millions of small voice samples, recorded from tens of thousands of people. In fact, to realise what we call a “minimum viable dataset” – one that is of sufficient size and quality to train a reasonably performant Deep Speech speech-to-text model in any given language – we would need ~2,000 to 3,000 hours of transcribed audio, from ~1,000 different speakers, for voice clips averaging 4 seconds each, this means about 1.8 million recordings.
Our main hub to do so is the Common Voice website where you can contribute your own voice by just reading a few example sentences. So far so good. But we know that more ideas for collecting voices can only help to improve the website or come up with something completely new. That’s why we made sure to build in a set of publicly accessible APIs that can allow for people like you to test additional ways to get people engaged. I thought I’d try this out with the idea of building a Telegram bot.
The basics
For a start, we will only concentrate on the recording aspect. The API is fairly simple. You first need to collect the sentence through a simple GET
. For this example, we’re assuming that you want to contribute to the English language:
GET https://voice.mozilla.org/api/v1/en/sentences/
[
{
"id": "597df77fba9610efed4d0bfa8bb75777f39fa7e46d61be3b23542ea26acf1b64",
"text": "The house was built using concrete.";
}
]
By using an API call (instead of just reading out of a database) we ensure the sentences are recorded with the correct statistical frequency that DeepSpeech needs.
Once the user records the sentence, we POST
the audio file to the server, passing the text and ID parameter we received in our GET
call.
The headers that need to be set are the following:
-
Content-Type
:application/octet-stream
-
sentence
: UTF-8 encoded sentence (the "text
" we got earlier) -
sentence_id
: the "id
" we got earlier -
client_id
: A unique identifier for the user. This is used when bucketing the data before training and testing the neural network. These buckets mustn’t have overlap, to ensure that the network doesn’t overfit on training data. It could be anything as long as it is unique per user (and per application): for the deployed Telegram bot we appended a fixed string (telegram_v1
) to Telegram’s generated unique user ID.
At that point, you can issue a POST request to https://voice.mozilla.org/api/v1/en/clips, with the bytearray of your audio file in the body of the request.
A few notes:
- It doesn’t really matter what encoding is used for the audio file, as long as ffmpeg (which we use server-side to transcode the clips) can read it. That said, we suggest using OGG or MP3 if you can.
- While there’s no real authentication involved, we ask to please use voice.allizom.org (our staging server) instead of voice.mozilla.org when doing development. Or run your own voice server. You can also check out whether uploads were successful (on staging) by uploading clips for empty languages.
Python code
Documentation is great, but it’s always better to have some code which is working. Here’s a simple implementation of the above written in Python, taken from the Telegram bot. Feel free to copy it over and use it in your code.
import urllib2
import json
base_url = "https://voice.allizom.org/api/v1/"
voice_lang = "en"
json_data = {}
user_id = "1234" # Generate a Unique User ID
data = json.load(urllib2.urlopen(base_url + voice_lang + '/sentences/'))
json_data["sentence_id"] = data[0]["id"]
json_data["sentence_text"] = data[0]["text"]
print("🎤 -- "+ data[0]["text"].encode('utf-8'))
audio_as_bytearray = ... # do record! This contains the byte array with the content
headers_dict = {
'Content-Type': "application/octet-stream",
'sentence': urllib2.quote(json_data["sentence_text"].encode("utf-8")),
'sentence_id': json_data["sentence_id"],
'client_id': 'myclient_v0.1' + "%i" %(user_id)
}
req = urllib2.Request(base_url + voice_lang + '/clips', audio_as_bytearray, headers=headers_dict)
res = urllib2.urlopen(req)
print(res.getcode())
Try it out live
Using the above code, I built a minimal Telegram bot which allows you to contribute clips to the English language. You can use it by talking to http://t.me/voicemozillaorg_bot. Albeit the bot is in fact minimal, it is perfectly working and it will send your clips to production, so please use it mindfully.
Hack on it
The code for the bot is available on Common Voice’s Github. It is primitive, but it should be a solid base to try hacking on Common Voice. Feel free to fork it, improve it (the interaction is still quite rusty, and there is no support for user profiles or stats) and/or simply translate it into your language. Can you think of more features (e.g. reminders, or gamification inside the bot) which you would like to see? Are you in contact with a maker space and want to help creating a new interactive physical experience? Pull requests are welcome, and so are all sorts of new experiments that make contributing voice clips more fun and engaging
If you have questions, please reach out to the team.