Can the letters of the alphabet be added to Common Voice?

I’m attempting to use DeepSpeech to create a “voice keyboard” for typing on the keyboard with just your voice. The speech recognition converts your words into keys that will be pressed on your keyboard (using robotjs).

Here’s a screenshot of some of my test code to give you an idea:

voice keyboard screenshot

The numbers and symbols all work fairly well. But some letters are actually impossible to say. The Common Voice DeepSpeech model does not have words that are even close to some letters, so i’ve been trying to come up with a set of substitutions like if I say “tea” it converts to "t’ and so on.

But this doesn’t work for many of the letters - D, E, F, G, H, Q, M, S, X etc. These letters don’t have english words that are very similar to how they are pronounced, so at best in some cases you’ll get like “the” when you say “D”, but never “dee” or “de”. For some letters, it can sometimes get it right if I say “letter S” “letter O”, but for a lot of them it seems damned near impossible.

I’m contemplating trying to build a DeepSpeech model to handle just letters. But i"d need tons of audio of lots of people saying these letters in different order / speed / accent etc.

I was wondering if it would be feasible to get Letters to be included in the main DeepSpeech Common Voice model. Maybe make some “fake” words like

X -> ecks
F -> eff
S -> ess
Q -> queue / cue - (these words seem to never get recognized when I say them and I speak english fluently)

As long as every letter has a real word or “fake” word that can get recognized reasonably well then this will work.

You might be able to achieve this with just a language model containing those letters - no need to retrain the acoustic model.

I’m moving this conversation to #deep-speech since I feel the answer is more about the capabilities of deep speech to identify letters.

If there are new requirements from deep speech we can always adapt how we collect the dataset over common voice later.


Creating a custom model is certainly an option for the task I’m trying to accomplish in this scenario - I just haven’t dug into how to do this yet. But I kind of feel like letters are something that DeepSpeech should be able to do better than it does at the moment.

There’s all kinds of scenarios I can think of where you want to be speaking words and letters intermixed. If I make a custom model, with a custom vocabulary then I can’t do things like say “iphone 10S”. If the Common Voice english model had letters built in, then it would/should recognize that as “I phone ten S”.

Saying addresses and postal codes, spelling out your name, entering data into spreadsheets, selecting items on a grid… all these potential things you could do with DeepSpeech aren’t really feasibile at the moment.

I encountered this problem quite a lot while trying to make a chess game, where I had to parse every possible scenario of piece/letter/number combinations like “knight to h 8” or "Pawn to E 1’. It was a bit of a letdown how hard it is for DeepSpeech to correctly recognize the difference between saying “H” “Eight”, “E”, “A”. Even with perfect pronunciation it can’t get the letters right or often it doesn’t return any recognition results at all.

dabinat suggested creating a custom language model, not an entire acoustic model. You can use the formula we use for generating the default release model and extend it with data containing phonetic spellings of letters. It’s not a huge effort requiring GPUs, you should be able to try it out and see how it does.

Okay that’s great. I wasn’t aware a language model can be made that extends the english one. I’ll look into the docs on how to do this.