Can I use pre-trained model with DeepSpeech.py?


(Kim) #1

Hi,

Please help me figure out how to use the pre-trained model with DeepSpeech.py.

I tried to pass the files from pre-trained models to DeepSpeech.py’s corresponding arguments:

–decoder_library_path
–alphabet_config_path
–lm_binary_path
–lm_trie_path

But I got error saying:

tensorflow.python.framework.errors_impl.NotFoundError: deepspeech-0.1.0-models/models/output_graph.pb: invalid ELF header

I’m not sure how to fix this.

The pre-trained models worked fine with deepspeech command line tool.

My goal is to get the output matrix instead of the text output, is there a work around for this?

Thank you.


(kdavis) #2

To clarify what do you mean by “output matrix”?


(Reuben Morais) #3

You can’t use the released model with DeepSpeech.py unless you made significant changes to it. There’s no command line parameter that expects a frozen model to be passed in. From the error, it looks like maybe you passed --decoder_library_path models/models/output_graph.pb, when you should be passing the path to libctc_decoder_with_kenlm.so to it.


(Yv) #4

My guess would be that Kim is looking for the probabilities of each character rather than the most likely character (before the language model is applied), something like this https://cdn-images-1.medium.com/max/1000/1*d1ktMdOnFOJRKKyjFP6sqQ.png.

If that’s the case, I’d be interested in it too (e.g. as a param of the commandline deepspeech).


(Kim) #5

Sorry about the confusion. Like yv001 said, by output matrix I meant the probabilities of each character.


(kdavis) #6

This would require writing a custom C++ operator that modifies our current custom C++ operator for CTC decoding[1] and would be a significant amount of work.


(Reuben Morais) #7

You don’t necessarily have to use DeepSpeech.py to access the logits, you could also write some code to load the release model using the TensorFlow API, then fetch the “logits” node. You could try modifying this code to load from the frozen graph instead of restoring a checkpoint: https://github.com/mozilla/DeepSpeech/blob/99d0c311a3e6108ed835fa38e4680ba7b3744fad/DeepSpeech.py#L1748-L1775

The relevant TensorFlow API is tf.import_graph_def.


(Yv) #8

A debugging script that shows most likely characters and their probabilities can be found here:
https://github.com/pvanickova/DeepSpeech/blob/master/bin/show_inferred_characters.py

It loads a frozen graph, runs inference on given wav, model and alphabet, softmaxes the logits to get the probabilities and displays most likely characters and their probabilities, “-” is used for blank. Each character prediction is on a new line so the predicted text can be read in columns.

E.g. “cat” string could have these character predictions:
c k - (0.999957) (1.62438e-05) (6.99057e-06)
a - (0.999998) (1.05044e-06) (4.43978e-07)
t d - (0.999999) (4.27885e-07) (1.09088e-07)

This is how the script is run:

python3 ./bin/show_inferred_characters.py --input-file "../data/my.wav" --model-file ../data/models/output_graph.pb --alphabet-file ../data/models/alphabet.txt --predicted-character-count 3

If it looks useful enough for others, give me a shout and I’ll create a pull request.


Output matrix from neural net
Simple way to get at raw probabilities/logits via python bindings?