Using Deep Speech

(Nik) #8

output without the language model:

olteritof me quicker in fraens can be perforemed asing i supported and veny a gpi on lenices se belo fight which ipis are spoied tet is tom by instar and soim igi butsiv package heth ti comand pep install deep speech hian gi pu

(Reuben Morais) #9

Yeah, looks like it’s not the language model, but rather the acoustic model is struggling with the audio :confused:

Could be due to noise in the recording, or maybe your accent. We definitely want to make our models more robust to things like that, by training with more varied data for example.

(Nik) #10

Do you think there is a lot to gain by using the 250 hours of Common Voice and trying to do the whole training process myself? Or might it be better to wait until there is about 5000 hours of data, which was used in the paper by baidu?


How can one do transfer learning using the pretrained DeepSpeech model?

(Yesterdays) #12

The line from deepspeech.model import Model provides the following error:

(Benjamin Burkhart) #13

I was looking at potentially using Deep Speech to align subtitles within video files, but would need to know when in the audio stream the inference started to do so (timings). I am a programmer, but would help if someone familiar with the project might give me a hint how I could get that data out of the inference process. Any ideas?

(Benjamin Burkhart) #14

You can disregard, just saw this thread Time Metadata


(Sawantilak) #15

Hey did you find the solution to this issue? I am facing ther same issue.

(Matti Meikäläinen) #16


I am testing the basic use of DeepSpeech with pre-trained model downloaded from and some test wav-files downloaded from The correct transcriptions for three below cases are “It is linked to the row over proposed changes at Scottish Ballet”, “Please call Stella” and “Ask her to bring these things with her from the store” respectively. The results suggested by the default model are something totally different:

AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_366.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.071s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.408s.
Running inference.
i do
Inference took 8.283s for 15.900s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_001.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 0.920s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.111s.
Running inference.
Inference took 4.822s for 6.155s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_002.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.026s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.217s.
Running inference.
a cage
Inference took 7.021s for 12.176s audio file.

Any ideas for the such behaviour?


(kdavis) #17

Are the wav audio files 16-bit, 16 kHz, and mono? If not, deepspeech can’t create transcripts for them.

(Lissyx) #18

@mark2 I just had a look at your files, and like mentionned, it’s 48kHz instead of 16kHz as expected, that explains the completely unexpected output.

(Lissyx) #19


alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:34.494758: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
u to o
cpu_time_overall=23.77523 cpu_time_mfcc=0.00953 cpu_time_infer=23.76570
alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:54.894628: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
it is lind to the row everyprprose changes at scosish balle
cpu_time_overall=22.01665 cpu_time_mfcc=0.00750 cpu_time_infer=22.00915

And one can do conversion like that:

alex@portable-alex:~/tmp/deepspeech/cpu$ ffmpeg -i ../test-data/vctk-p225/wav48/p225/p225_366.wav -acodec pcm_s16le -ac 1 -ar 16000 ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav

(Matti Meikäläinen) #20

Thanks! Now it gives more reasonable answers.

(Buvana R) #21

Hello, what are the training data sets that went into the model that is available at

Will you release a fully trained NN?
(kdavis) #22

LibriSpeech[1], Fisher[2,3,4,5], and Switchboard[6]

(Mirko) #23

I was wondering on execution time:

Inference took 3.607s for 1.393s audio file.

Is this normal exec time, since I have seen some examples online that 30s was needed for 28sec of audio file.
Also I know few examples where usual time is half of duration of audio.
I haven’t tried GPU powered deepspeech since my hardware+OS is in fight with Nvidia atm.


(kdavis) #24

Unfortunately, as to whether this is “normal” is all hardware dependent.

(Mirko) #25

Thanks for reply :slightly_smiling_face:
So if I prepare a more powerful GPU box, I should expect much better results.
The only reason is that some of the proprietary software brag about 1/0.5 ratio of duration/transcription …

(kdavis) #26

A 1/0.5 ratio should be achievable on a GeForce GTX 1070 or above for clips a few sec long.

(Mirko) #27

Great, thanks for info kdavis :slight_smile: