Using Deep Speech

kdavis · November 1, 2017, 2:28pm

Covers topics concerned with the use of Deep Speech

maboa · November 14, 2017, 11:31am

Just wanted to say this is great to see, since I’m working in the area of STT I’m very much looking forward to the discussion on this topic - hoping to contribute

psukys · November 17, 2017, 4:48pm

Is there a user guide for using pre-trained models?

kdavis · November 21, 2017, 7:21am

In the coming days we will release an American English model an info on its use.

elpimous_robot · November 21, 2017, 3:44pm

Thanks a lot for Deepspeech.
It really improves the STT accuracy (quite better than cmusphinx !!)

Nik · November 29, 2017, 9:02pm

I’ve been testing the model which was released a few days ago. I recorded myself saying a few lines which are found in the readme.

The expected result:

Alternatively, quicker inference can be performed using a supported NVIDIA GPU on Linux. (See below to find which GPU’s are supported.) This is done by instead installing the GPU specific package with the command:
pip install deepspeech-gpu

actual result can be seen in the image below

I can unfortunately not upload the .wav file here, if it’s necessary I can upload it somewhere else.

Is this the expected performance of deep speech? I’m hypothesising that the language model used is not trained on the vocabulary I’m using. Is there anything to gain by looking at another language model?

reuben · November 29, 2017, 9:11pm

To test if the language model is negatively influencing the results, simply omit the last two parameters (lm.binary and trie) and see if it improves.

Nik · November 29, 2017, 9:16pm

output without the language model:

olteritof me quicker in fraens can be perforemed asing i supported and veny a gpi on lenices se belo fight which ipis are spoied tet is tom by instar and soim igi butsiv package heth ti comand pep install deep speech hian gi pu

reuben · November 29, 2017, 9:25pm

Yeah, looks like it’s not the language model, but rather the acoustic model is struggling with the audio

Could be due to noise in the recording, or maybe your accent. We definitely want to make our models more robust to things like that, by training with more varied data for example.

Nik · November 30, 2017, 6:15pm

Do you think there is a lot to gain by using the 250 hours of Common Voice and trying to do the whole training process myself? Or might it be better to wait until there is about 5000 hours of data, which was used in the paper by baidu?

readwrite · November 30, 2017, 9:15pm

How can one do transfer learning using the pretrained DeepSpeech model?

yesterdays · December 1, 2017, 12:08am

The line from deepspeech.model import Model provides the following error:

sawantilak · December 19, 2017, 11:20am

Hey did you find the solution to this issue? I am facing ther same issue.

mark2 · December 22, 2017, 9:35am

Hi!

I am testing the basic use of DeepSpeech with pre-trained model downloaded from https://github.com/mozilla/DeepSpeech/releases and some test wav-files downloaded from https://www.dropbox.com/s/xecprghgwbbuk3m/vctk-pc225.tar.gz?dl=1. The correct transcriptions for three below cases are “It is linked to the row over proposed changes at Scottish Ballet”, “Please call Stella” and “Ask her to bring these things with her from the store” respectively. The results suggested by the default model are something totally different:

AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_366.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.071s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.408s.
Running inference.
i do
Inference took 8.283s for 15.900s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_001.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 0.920s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.111s.
Running inference.
huh
Inference took 4.822s for 6.155s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_002.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.026s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.217s.
Running inference.
a cage
Inference took 7.021s for 12.176s audio file.

Any ideas for the such behaviour?

BR,
Mark

kdavis · December 22, 2017, 9:46am

Are the wav audio files 16-bit, 16 kHz, and mono? If not, deepspeech can’t create transcripts for them.

lissyx · December 22, 2017, 9:47am

@mark2 I just had a look at your files, and like mentionned, it’s 48kHz instead of 16kHz as expected, that explains the completely unexpected output.

lissyx · December 22, 2017, 9:49am

FTR:

alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:34.494758: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
u to o
cpu_time_overall=23.77523 cpu_time_mfcc=0.00953 cpu_time_infer=23.76570
alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:54.894628: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
it is lind to the row everyprprose changes at scosish balle
cpu_time_overall=22.01665 cpu_time_mfcc=0.00750 cpu_time_infer=22.00915

And one can do conversion like that:

alex@portable-alex:~/tmp/deepspeech/cpu$ ffmpeg -i ../test-data/vctk-p225/wav48/p225/p225_366.wav -acodec pcm_s16le -ac 1 -ar 16000 ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav
alex@portable-alex:~/tmp/deepspeech/cpu$

mark2 · December 22, 2017, 10:12am

Thanks! Now it gives more reasonable answers.

b.r · February 23, 2018, 5:48pm

Hello, what are the training data sets that went into the model that is available at https://github.com/mozilla/DeepSpeech/releases?

kdavis · February 23, 2018, 5:55pm

LibriSpeech[1], Fisher[2,3,4,5], and Switchboard[6]

Topic		Replies	Views
Will the results be different when use the trained model under "deepspeech" and "deepspeech-gpu"? DeepSpeech	5	1098	July 11, 2018
Share your trained model for Mozilla DeepSpeech? DeepSpeech	6	465	April 14, 2020
DeepSpeech benchmarking / Shorten inference time DeepSpeech	16	5433	February 14, 2018
Integration of DeepSpeech-Polyglot's new networks DeepSpeech	11	1656	April 28, 2021
Steps to reach Deep Speech in the wild DeepSpeech	3	495	August 17, 2020

Using Deep Speech

Related topics