Using Deep Speech

reuben · November 29, 2017, 9:25pm

Yeah, looks like it’s not the language model, but rather the acoustic model is struggling with the audio

Could be due to noise in the recording, or maybe your accent. We definitely want to make our models more robust to things like that, by training with more varied data for example.

Nik · November 30, 2017, 6:15pm

Do you think there is a lot to gain by using the 250 hours of Common Voice and trying to do the whole training process myself? Or might it be better to wait until there is about 5000 hours of data, which was used in the paper by baidu?

readwrite · November 30, 2017, 9:15pm

How can one do transfer learning using the pretrained DeepSpeech model?

yesterdays · December 1, 2017, 12:08am

The line from deepspeech.model import Model provides the following error:

Benjamin_Burkhart · December 3, 2017, 8:25pm

I was looking at potentially using Deep Speech to align subtitles within video files, but would need to know when in the audio stream the inference started to do so (timings). I am a programmer, but would help if someone familiar with the project might give me a hint how I could get that data out of the inference process. Any ideas?

Benjamin_Burkhart · December 3, 2017, 8:28pm

You can disregard, just saw this thread Time Metadata

sawantilak · December 19, 2017, 11:20am

Hey did you find the solution to this issue? I am facing ther same issue.

mark2 · December 22, 2017, 9:35am

Hi!

I am testing the basic use of DeepSpeech with pre-trained model downloaded from https://github.com/mozilla/DeepSpeech/releases and some test wav-files downloaded from https://www.dropbox.com/s/xecprghgwbbuk3m/vctk-pc225.tar.gz?dl=1. The correct transcriptions for three below cases are “It is linked to the row over proposed changes at Scottish Ballet”, “Please call Stella” and “Ask her to bring these things with her from the store” respectively. The results suggested by the default model are something totally different:

AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_366.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.071s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.408s.
Running inference.
i do
Inference took 8.283s for 15.900s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_001.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 0.920s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.111s.
Running inference.
huh
Inference took 4.822s for 6.155s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_002.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.026s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.217s.
Running inference.
a cage
Inference took 7.021s for 12.176s audio file.

Any ideas for the such behaviour?

BR,
Mark

kdavis · December 22, 2017, 9:46am

Are the wav audio files 16-bit, 16 kHz, and mono? If not, deepspeech can’t create transcripts for them.

lissyx · December 22, 2017, 9:47am

@mark2 I just had a look at your files, and like mentionned, it’s 48kHz instead of 16kHz as expected, that explains the completely unexpected output.

lissyx · December 22, 2017, 9:49am

FTR:

alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:34.494758: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
u to o
cpu_time_overall=23.77523 cpu_time_mfcc=0.00953 cpu_time_infer=23.76570
alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:54.894628: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
it is lind to the row everyprprose changes at scosish balle
cpu_time_overall=22.01665 cpu_time_mfcc=0.00750 cpu_time_infer=22.00915

And one can do conversion like that:

alex@portable-alex:~/tmp/deepspeech/cpu$ ffmpeg -i ../test-data/vctk-p225/wav48/p225/p225_366.wav -acodec pcm_s16le -ac 1 -ar 16000 ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav
alex@portable-alex:~/tmp/deepspeech/cpu$

mark2 · December 22, 2017, 10:12am

Thanks! Now it gives more reasonable answers.

b.r · February 23, 2018, 5:48pm

Hello, what are the training data sets that went into the model that is available at https://github.com/mozilla/DeepSpeech/releases?

kdavis · February 26, 2018, 8:47am

LibriSpeech[1], Fisher[2,3,4,5], and Switchboard[6]

mirkobrankovic · April 25, 2018, 9:19pm

Hi,
I was wondering on execution time:

Inference took 3.607s for 1.393s audio file.

Is this normal exec time, since I have seen some examples online that 30s was needed for 28sec of audio file.
Also I know few examples where usual time is half of duration of audio.
I haven’t tried GPU powered deepspeech since my hardware+OS is in fight with Nvidia atm.

Thanks,
mirko

kdavis · April 30, 2018, 6:07am

Unfortunately, as to whether this is “normal” is all hardware dependent.

mirkobrankovic · April 30, 2018, 7:41am

Thanks for reply
So if I prepare a more powerful GPU box, I should expect much better results.
The only reason is that some of the proprietary software brag about 1/0.5 ratio of duration/transcription …

kdavis · April 30, 2018, 7:55am

A 1/0.5 ratio should be achievable on a GeForce GTX 1070 or above for clips a few sec long.

mirkobrankovic · April 30, 2018, 8:11am

Great, thanks for info kdavis
Regards,
mirko

shriya485 · May 30, 2018, 11:30am

How i can use the pre trained model?