Using Deep Speech

mark2 · December 22, 2017, 9:35am

Hi!

I am testing the basic use of DeepSpeech with pre-trained model downloaded from https://github.com/mozilla/DeepSpeech/releases and some test wav-files downloaded from https://www.dropbox.com/s/xecprghgwbbuk3m/vctk-pc225.tar.gz?dl=1. The correct transcriptions for three below cases are “It is linked to the row over proposed changes at Scottish Ballet”, “Please call Stella” and “Ask her to bring these things with her from the store” respectively. The results suggested by the default model are something totally different:

AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_366.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.071s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.408s.
Running inference.
i do
Inference took 8.283s for 15.900s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_001.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 0.920s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.111s.
Running inference.
huh
Inference took 4.822s for 6.155s audio file.
AMAC02TX3KKHTD8:DeepSpeech mark$ deepspeech …/models/output_graph.pb p225_002.wav …/models/alphabet.txt …/models/lm.binary …/models/trie
Loading model from file …/models/output_graph.pb
Loaded model in 1.026s.
Loading language model from files …/models/lm.binary …/models/trie
Loaded language model in 3.217s.
Running inference.
a cage
Inference took 7.021s for 12.176s audio file.

Any ideas for the such behaviour?

BR,
Mark

kdavis · December 22, 2017, 9:46am

Are the wav audio files 16-bit, 16 kHz, and mono? If not, deepspeech can’t create transcripts for them.

lissyx · December 22, 2017, 9:47am

@mark2 I just had a look at your files, and like mentionned, it’s 48kHz instead of 16kHz as expected, that explains the completely unexpected output.

lissyx · December 22, 2017, 9:49am

FTR:

alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:34.494758: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
u to o
cpu_time_overall=23.77523 cpu_time_mfcc=0.00953 cpu_time_infer=23.76570
alex@portable-alex:~/tmp/deepspeech/cpu$ LC_ALL=C ./deepspeech ../models/output_graph.pb ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav ../models/alphabet.txt -t 2>&1 
2017-12-22 10:48:54.894628: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
it is lind to the row everyprprose changes at scosish balle
cpu_time_overall=22.01665 cpu_time_mfcc=0.00750 cpu_time_infer=22.00915

And one can do conversion like that:

alex@portable-alex:~/tmp/deepspeech/cpu$ ffmpeg -i ../test-data/vctk-p225/wav48/p225/p225_366.wav -acodec pcm_s16le -ac 1 -ar 16000 ../test-data/vctk-p225/wav48/p225/p225_366.16k.wav
alex@portable-alex:~/tmp/deepspeech/cpu$

mark2 · December 22, 2017, 10:12am

Thanks! Now it gives more reasonable answers.

b.r · February 23, 2018, 5:48pm

Hello, what are the training data sets that went into the model that is available at https://github.com/mozilla/DeepSpeech/releases?

kdavis · February 26, 2018, 8:47am

LibriSpeech[1], Fisher[2,3,4,5], and Switchboard[6]

mirkobrankovic · April 25, 2018, 9:19pm

Hi,
I was wondering on execution time:

Inference took 3.607s for 1.393s audio file.

Is this normal exec time, since I have seen some examples online that 30s was needed for 28sec of audio file.
Also I know few examples where usual time is half of duration of audio.
I haven’t tried GPU powered deepspeech since my hardware+OS is in fight with Nvidia atm.

Thanks,
mirko

kdavis · April 30, 2018, 6:07am

Unfortunately, as to whether this is “normal” is all hardware dependent.

mirkobrankovic · April 30, 2018, 7:41am

Thanks for reply
So if I prepare a more powerful GPU box, I should expect much better results.
The only reason is that some of the proprietary software brag about 1/0.5 ratio of duration/transcription …

kdavis · April 30, 2018, 7:55am

A 1/0.5 ratio should be achievable on a GeForce GTX 1070 or above for clips a few sec long.

mirkobrankovic · April 30, 2018, 8:11am

Great, thanks for info kdavis
Regards,
mirko

shriya485 · May 30, 2018, 11:30am

How i can use the pre trained model?

kdavis · May 30, 2018, 1:04pm

The README of the current v.0.1.1 release describes usage.

davidradio · June 2, 2018, 11:24pm

Hi! I got same error, did yu fixed it?

davidradio · October 25, 2018, 3:12pm

anybody can fix?

lissyx · October 25, 2018, 3:14pm

Please avoid hijacking threads, properly document your error and your setup, otherwise, nobody can help you.

rinin_farina · December 15, 2018, 3:43am

based on snapshot by @yesterdays, python 2.7.5 is used. use python 3.5 instead. i am using python 3.5 and it worked fine.

rbewoor · August 1, 2019, 1:28pm

Hi All,
I am trying to train and use a model for English from scratch on version 0.5.1. My aim to train two models, one with and without a language model. Request your help on several fronts please. Sorry this is long but trying be as detailed as possible; and also, being new to Linux and data-science I may be stating some very obvious things.
Thank you in advance for your help.
Regards,
Rohit

A) Background:

A1) Using Ubuntu 18.04LTS, no GPU, 32GB ram.

Downloaded Mozilla Common Voice Corpus (English) around mid-June 2019.
Took the validated.tsv file, did some basic transcript validation and pruned dataset to 629731 entries.
Selected first 10k entries and split using ratio of 70:20:10 as train:dev:test and created csv files.
MP3s converted to wav files (16kHz, mono, 16bit), length less than 10 seconds.
Setup Anaconda environment with Deepspeech v0.5.1.
Cloned github v0.5.1 code.
Issued command in the Deepspeech folder, which seems to be required to create the generate_trie executable and other required setup:
python util/taskcluster.py --target .
Installed the CTC-decoder from the link obtained from command:
python util/taskcluster.py --decoder
Next created vocabulary file with only the transcripts.
No changes in any of the flags and other default parameters.

A2) Language model related:

Used KenLM. Downloaded from git repo and compiled. Commands to create 4-gram version:
vocabulary file to arpa:

./lmplz -o 4 --text /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/vocabDir/vocabulary-Set3First10k.txt --arpa /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/vocabDir/vocabulary-Set3First10k_4gram.arpa

arpa to lm_binary file:

./build_binary /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/vocabDir/vocabulary-Set3First10k_4gram.arpa /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/lm/lm4gram/vocabulary-Set3First10k_4gram.klm

used the generate_trie to make the trie file

/home/rohit/dpspCODE/v051/DeepSpeech/generate_trie /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/alphabetDir/alphabet-Set3First10k.txt /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/lm/lm4gram/vocabulary-Set3First10k_4gram.klm /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/trie/trie4gram/set3First10k_4gram.trie

Note the trie file was made successfully and later used to start training.

A3) Commands to start model training (training in progress still):

A3a) Model without language model:

python3 -u DeepSpeech.py
–train_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/train.csv
–dev_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/dev.csv
–test_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/test.csv
–train_batch_size 1
–dev_batch_size 1
–test_batch_size 1
–n_hidden 2048
–epoch 20
–dropout_rate 0.15
–learning_rate 0.0001
–export_dir /home/rohit/dpspTraining/models/v051/model5-validFirst10k-noLM/savedModel
–checkpoint_dir /home/rohit/dpspTraining/models/v051/model5-validFirst10k-noLM/checkpointDir
–alphabet_config_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/alphabetDir/alphabet-Set3First10k.txt
“$@”

A3b) Model with Language model:

python3 -u DeepSpeech.py
–train_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/train.csv
–dev_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/dev.csv
–test_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/test.csv
–train_batch_size 1
–dev_batch_size 1
–test_batch_size 1
–n_hidden 2048
–epoch 20
–dropout_rate 0.15
–learning_rate 0.0001
–export_dir /home/rohit/dpspTraining/models/v051/model6-validFirst10k-yesLM-4gram/savedModel
–checkpoint_dir /home/rohit/dpspTraining/models/v051/model6-validFirst10k-yesLM-4gram/checkpointDir
–decoder_library_path /home/rohit/dpspCODE/v051/DeepSpeech/native_client/libctc_decoder_with_kenlm.so
–alphabet_config_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/alphabetDir/alphabet-Set3First10k.txt
–lm_binary_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/lm/lm4gram/vocabulary-Set3First10k_4gram.klm
–lm_trie_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/trie/trie4gram/set3First10k_4gram.trie
“$@”

B) My questions:

B1) When using a language model either for training or inference, do I HAVE to specify the lm_binary parameter AND the corresponding trie file? Can using only the trie work?

B2) Irrespective of whether a language model was used while training the model (binaryFile and trie together), later when the model is used for inference, can I choose to use OR not use a language model? Can a different language model be used later or only the one used for training? Are there things to note while choosing an alternative model? E.g. training using a 3-gram model but using a 4-gram model during inference? etc…

B3) Suppose my model is already built by training on a vocabulary file, arpa, trie and lm_binary built from only 10k data points. Say I create a new vocabulary called BigVocabulary.file from a larger corpus than the one used for training.

E.g. the entire 629731 data points in validated.tsv file; use bigger vocabulary to create the .arpa, lmBinary and trie files. I ensure that the valid characters are exactly the same by comparing the alphabet files. Then on the model trained with smaller vocabulary, can I use BigVocabulary.binary.file and BigVocabulary.trie while doing inference using the command?

I already created a model with only first 1000 files and inference is poor but works.
Command:

deepspeech
–model /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
–alphabet /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/alphabetDir/alphabet-Set5First1050.txt
–lm /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/lm/lm4gram/vocabulary-Set5First1050_4gram.klm
–trie /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/trie/trie4gram/Set5First1050_4gram.trie
–audio /home/rohit/dpspTraining/data/wavFiles/wav33/test/File28.

Console output:

(dpsp5v051basic) rohit@DE-W-0246802:~/dpspCODE/v051/DeepSpeech$ deepspeech \

–model /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
–alphabet /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/alphabetDir/alphabet-Set5First1050.txt
–lm /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/lm/lm4gram/vocabulary-Set5First1050_4gram.klm
–trie /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/trie/trie4gram/Set5First1050_4gram.trie
–audio /home/rohit/dpspTraining/data/wavFiles/wav33/test/File28.wav
Loading model from file /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-08-01 16:11:02.155443: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-01 16:11:02.179690: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “CPU”’) for unknown op: UnwrapDatasetVariant
2019-08-01 16:11:02.179740: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:02.179756: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “CPU”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:02.179891: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: UnwrapDatasetVariant
Loaded model in 0.0283s.
Loading language model from files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/lm/lm4gram/vocabulary-Set5First1050_4gram.klm /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/trie/trie4gram/Set5First1050_4gram.trie
Loaded language model in 0.068s.
Running inference.
a on a in a is the
Inference took 0.449s for 3.041s audio file.

But if i use the BigVocabulary.trie and lmBinary files then I get an error saying “Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.” but it still seems to load the language model. So did Deepspeech actually pick it up and apply it correctly? How do I fix this error?

Command:

deepspeech
–model /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
–alphabet /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/alphabetDir/alphabet-Set5First1050.txt
–lm /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/lm/lm4gram/vocabulary-allValidated_o4gram.klm
–trie /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/trie/trie4gram/allValidated_o4gram.trie
–audio /home/rohit/dpspTraining/data/wavFiles/wav33/test/File28.wav

Console output:

(dpsp5v051basic) rohit@DE-W-0246802:~/dpspCODE/v051/DeepSpeech$ deepspeech \

–model /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
–alphabet /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/alphabetDir/alphabet-Set5First1050.txt
–lm /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/lm/lm4gram/vocabulary-allValidated_o4gram.klm
–trie /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/trie/trie4gram/allValidated_o4gram.trie
–audio /home/rohit/dpspTraining/data/wavFiles/wav33/test/File28.wav
Loading model from file /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-08-01 16:11:58.305524: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-01 16:11:58.322902: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “CPU”’) for unknown op: UnwrapDatasetVariant
2019-08-01 16:11:58.322945: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:58.322956: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “CPU”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:58.323063: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: UnwrapDatasetVariant
Loaded model in 0.0199s.
Loading language model from files /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/lm/lm4gram/vocabulary-allValidated_o4gram.klm /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/trie/trie4gram/allValidated_o4gram.trie
Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.
Loaded language model in 0.00368s.
Running inference.
an on o tn o as te tee
Inference took 1.893s for 3.041s audio file.

Thank you for your time.

SamahZaro · August 11, 2019, 3:34pm

I think this should be in a new thread.

However, to the best of my knowledge, the language model doesn’t affect training or validation, it is just used in the decoding step while testing the trained model. So, the is no mean of training two different models.