Using Deep Speech

(Kdavis) #1

Covers topics concerned with the use of Deep Speech

(Mark Boas) #2

Just wanted to say this is great to see, since I’m working in the area of STT I’m very much looking forward to the discussion on this topic - hoping to contribute :slight_smile:

(Paulius Šukys) #3

Is there a user guide for using pre-trained models?

(Kdavis) #4

In the coming days we will release an American English model an info on its use.

(Vincent Foucault) #5

Thanks a lot for Deepspeech.
It really improves the STT accuracy (quite better than cmusphinx !!)

(Nik) #6

I’ve been testing the model which was released a few days ago. I recorded myself saying a few lines which are found in the readme.

The expected result:

Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU’s are supported.) This is done by instead installing the GPU specific package with the command:
pip install deepspeech-gpu

actual result can be seen in the image below

I can unfortunately not upload the .wav file here, if it’s necessary I can upload it somewhere else.

Is this the expected performance of deep speech? I’m hypothesising that the language model used is not trained on the vocabulary I’m using. Is there anything to gain by looking at another language model?

(Reuben Morais) #7

To test if the language model is negatively influencing the results, simply omit the last two parameters (lm.binary and trie) and see if it improves.

(Nik) #8

output without the language model:

olteritof me quicker in fraens can be perforemed asing i supported and veny a gpi on lenices se belo fight which ipis are spoied tet is tom by instar and soim igi butsiv package heth ti comand pep install deep speech hian gi pu

(Reuben Morais) #9

Yeah, looks like it’s not the language model, but rather the acoustic model is struggling with the audio :confused:

Could be due to noise in the recording, or maybe your accent. We definitely want to make our models more robust to things like that, by training with more varied data for example.

(Nik) #10

Do you think there is a lot to gain by using the 250 hours of Common Voice and trying to do the whole training process myself? Or might it be better to wait until there is about 5000 hours of data, which was used in the paper by baidu?


How can one do transfer learning using the pretrained DeepSpeech model?

(Yesterdays) #12

The line from deepspeech.model import Model provides the following error:

(Benjamin Burkhart) #13

I was looking at potentially using Deep Speech to align subtitles within video files, but would need to know when in the audio stream the inference started to do so (timings). I am a programmer, but would help if someone familiar with the project might give me a hint how I could get that data out of the inference process. Any ideas?

(Benjamin Burkhart) #14

You can disregard, just saw this thread Time Metadata


(Sawantilak) #15

Hey did you find the solution to this issue? I am facing ther same issue.

(Matti Meikäläinen) #16


I am testing the basic use of DeepSpeech with pre-trained model downloaded from and some test wav-files downloaded from The correct transcriptions for three below cases are “It is linked to the row over proposed changes at Scottish Ballet”, “Please call Stella” and “Ask her to bring these things with her from the store” respectively. The results suggested by the default model are something totally different:

AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_366.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.071s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.408s.
Running inference.
i do
Inference took 8.283s for 15.900s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_001.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 0.920s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.111s.
Running inference.
Inference took 4.822s for 6.155s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_002.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.026s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.217s.
Running inference.
a cage
Inference took 7.021s for 12.176s audio file.

Any ideas for the such behaviour?


(Kdavis) #17

Are the wav audio files 16-bit, 16 kHz, and mono? If not, deepspeech can’t create transcripts for them.

(Lissyx) #18

@mark2 I just had a look at your files, and like mentionned, it’s 48kHz instead of 16kHz as expected, that explains the completely unexpected output.

(Lissyx) #19


alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:34.494758: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
u to o
cpu_time_overall=23.77523 cpu_time_mfcc=0.00953 cpu_time_infer=23.76570
alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:54.894628: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
it is lind to the row everyprprose changes at scosish balle
cpu_time_overall=22.01665 cpu_time_mfcc=0.00750 cpu_time_infer=22.00915

And one can do conversion like that:

alex@portable-alex:~/tmp/deepspeech/cpu$ ffmpeg -i ../test-data/vctk-p225/wav48/p225/p225_366.wav -acodec pcm_s16le -ac 1 -ar 16000 ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav

(Matti Meikäläinen) #20

Thanks! Now it gives more reasonable answers.