How would a non-cloud-based FOSS solution work in practice?


(Sean) #1

This is a continuation of GitHub #324 with the original responses reposted here. Folks are welcome to continue the discussion!

Original post by me:

I’m very engaged and interested in Common Voice and its potential, and I know it’s strictly focused on gathering the data needed to train a speech rec algorithm. But I have a question about how this will actually be used once the dataset is released.

First of all, I’m interested in offline, non-cloud-based solutions running on commodity laptops or even smartphones. Correct me if I’m wrong, but any such solution would not have to download a copy of the entire Common Voice library of data, right? About how big would the actual training (meta)data from a modern speech recognition library be, if we provided it Common Voice’s speech data as a training set?

I don’t need exact numbers, but are we talking about on the order of a couple megs, a couple gigs, or tens of gigs or more? Assume the library is designed for high fidelity recognition but uses all the state of the art machine learning tricks.

My second question: How do you think updates to the training would work? Could this be automated, so that, say, a speech rec library that consumes Common Voice could “re-train” every few months as Common Voice continues to build a bigger and bigger library? Would this re-training take place on a centralized server and affect the code of the library, or could each and every user of the library do it efficiently on their own system?

I ask because I currently wrote a TeamSpeak 3 plugin for deaf users that recognizes users’ spoken text and prints it in chat. I’m using Google Cloud Speech API, which is extremely expensive for a hobbyist (between $5 and $50 per month per person, depending on how much you use it) because none of the “offline” speech recognition libraries I could find were even in the same realm of accuracy as Google’s solution.

Even with perfectly ordinary US and British English accents, high-quality studio microphones with a pop filter and the optimal amount of gain, with the voice data coming from TeamSpeak 3’s Opus codec with the quality maxed out, libraries such as CMUSphinx had an accuracy of about 10-15%. Under the same conditions, Google Cloud Speech API would have about an 80% to 98% accuracy, depending on how fast the user speaks.

My hope is that, either:

  • Existing FOSS speech rec libraries like CMUSphinx are able to incorporate Common Voice’s data once it’s released, and that this will be enough to make CMUSphinx at least competitive with Google Cloud Speech
  • Mozilla is planning to create a new FOSS speech rec library that uses more modern techniques to better make use of the dataset and generate even higher quality recognition once trained

If I can get rid of my dependency on the expensive cloud service, where I’m paying Google loads of money just to help a deaf user in my community, that would be a huge win. My TeamSpeak 3 plugin is already open source, but I can see plenty of initiatives where folks would write plugins for other voice chat programs (Skype, Ventrilo, Hangouts, etc.), and after a while, Common Voice could be the dataset behind just about every voice program offering accurate speech recognition to deaf users.

Now, wouldn’t that be an enormous win for the significant proportion of the population who are deaf?


Response by Omniscimus:

Huge thumbs up for thuis post! :smile: I’m just a github contributor, but afaik:

any such solution would not have to download a copy of the entire Common Voice library of data

Right, it should be possible to train an AI using the dataset and then upload just the AI to a local machine and have it figure out new samples.

I really hope that the Common Voice dataset will be large enough and that its quality will be good. I’m interested in updates as well. The goal of the project seems to be to gather a large data set and be done with it. Wouldn’t it be better to keep improving the data set (quality and size) after the first set has been gathered? I think it’s very likely that the dataset will need to keep improving (even if the first result is great).

Maybe some projects that will use the dataset could give feedback and/or additional recordings back to Common Voice. I believe Google does this as well, e.g. anything you say to Google Now gets uploaded to their servers and kept to improve on their dataset. Obviously there are privacy concerns here, but it could be an opt-in.


Next reply by me (allquixotic):

Thanks for your reply! But it just made me think of another question…

I’m not 100% sure, but I think that lossy encoding applied to voice (instead of it being raw PCM straight from the microphone, e.g. WAV or FLAC) really throws off speech recognition algorithms.

This is why I believe Google initially required lossless formats for Speech API, and only opened it up to lossy upon repeated request. But the lossy format recognition is worse.

Would it be feasible for Common Voice to encode samples uploaded by users at random bitrates and with a variety of common codecs? At minimum you’d want Opus Voice, MP3, and Vorbis, and some data still in lossless format like FLAC. AAC could be important too.

Oh and then there’s the codec used for monaural microphone capture in the Bluetooth HeadSet Profile (HSP), which specifies SCO as the standard. SCO is really, really bad, and some Common Voice users are submitting their voice submissions already “pre-encoded” in SCO from their Bluetooth headset they’re using to read (Analog audio waves -> microphone -> ADC -> PCM -> SCO -> Bluetooth HSP protocol -> computer/phone/tablet -> web browser -> whatever encoding Common Voice uses -> Common Voice servers).

So there’s clearly going to be cases where we have audio encoded lossily more than once. In an extreme example, with TeamSpeak 3, it would get encoded in SCO, then back to PCM, then to Opus, before my speech recognition plugin gets to it. Then I’m encoding it once more (I’m using FLAC because it’s accurate, but it incurs a heavy upstream cost under heavy use) and sending it to Google Cloud Speech API right now. Triple ouch!

If this encoding and re-encoding poses problems for speech recognition engines, even the “good ones” based on TensorFlow and etc, then Common Voice needs to record several times more than their 10,000 hours which they planned. Maybe 100,000 hours to capture all the diversity of the sorts of waveforms that get mangled for “psychoacoustic optimization” (sounds good) while the actual data looks almost random to a computer.


Next reply by rugk:

Note that while not a FOSS solution, rather a “proof of concept” here, remember that Windows 7 had a speech recognition, which worked completely offline and required no network connection. The “fact” that his has to be done online at some random server, was largely invented afterwards by Google, Apple and finally Microsoft…
I think Common Voice is a good project to change this again and allow the creation of privacy-friendly speech-recognition services, which could then be used in IoT/Smart home devices, etc.


Last reply by mikehenrty:

Hi everyone, great discussion!

There’s a lot here to talk about, so thanks for bringing these issues up. I will try to answer a few questions, but then I’m going to ask if we can move this conversation over to Discourse, since it’ll be hard to pull anything actionable in the code from such an expansive discussion.

I don’t need exact numbers, but are we talking about on the order of a couple megs, a couple gigs, or tens of gigs or more?

Tens of gigs or more :slight_smile:

My second question: How do you think updates to the training would work?

This is actually out of scope of Common Voice. CV is only about collecting public domain data that can be used to train voice recognition algorithms. Note, Mozilla is also working on the DeepSpeech project, which is an open source speech-to-text engine. We hope to use the CV data to one day train DeepSpeech.

Now, wouldn’t that be an enormous win for the significant proportion of the population who are deaf?

Yes absolute! We would love to help you achieve this with the CV data!!


(Sean) #2

Overall it seems like the topics in the post include:

  • The size of the trained algorithm / neural network required to take advantage of the Common Voice dataset – @mikehenrty implied it’d be 10s of gigs or more (or misunderstood my question).

So every user using the CV dataset with something like DeepSpeech would have to download all this data to their computer?

This is one (significant) part to understanding the difficulty of implementing an “offline” speech recognition engine.

  • Lossy encoding in real-world systems could impact recognition accuracy

  • Privacy-friendly speech recognition that could be used “offline” (without phoning home to a server) is highly desirable by all involved in the discussion so far.

  • mikehenrty pointed out that DeepSpeech is Mozilla’s eventual destination for the Common Voice data – I didn’t know this project existed :slight_smile: but now I do, so I’ll be watching that project as well!


(Omniscimus) #3

So this is outside of Common Voice’s scope, but I wonder how a file of basically sound waves can be converted to words. As a human, I have no more difficulty interpreting a wav file than an mp3 file, so what’s the difference for an algorithm between those formats? Any thoughts?


(Wim) #4

I’m sure it depends a lot on exactly what technique the recognizer is using, but pre-trained deep neural net models I’ve seen in other contexts have ranged from tens to low hundreds of megabytes. That’s a large download, but not impractical to store on your phone, I think. (It’s certainly much smaller than the training set, which from what I’ve read of this project, might amount to a terabyte or more of audio?)


(Michael Henretty) #5

Thanks for your reply here @wiml!

I also want to highlight the fact that there are two possible “downloads” we are talking about here:

  • Raw Voice Data (large, GBs)
  • Speech Recognition models trained from above data (smaller, MBs)

Common Voice is only about the raw data for now. See project DeepSpeech for turning that data into usable models.