Getting logits as output

Is there a way to get the logits (output of the softmax layer) during inference from the exported graph - output_graph.pb? I want to use my own CTC decoder and thus looking for a way to predict just the logits from the graph.

Is there a way to achieve this through the python deepspeech API? If not, can you please point me to some other way I can achieve this?

no

just access logits tensor?

Yeah, I did that. Found the logits tensor. Can you also please tell me how to send input data? I found the node named input_samples takes in as input of size (512,). Does that correspond to raw audio data?

Are we talking about the inference code or the training code ?

Inferencing. I am trying to load output_graph.pb and use it to extract logits from a audio file. What I am not able to understand is that which one is the input node in the graph which accepts the audio file as input?

Here is my code so far:

import tensorflow as tf
from tensorflow.python.platform import gfile
from scipy.io import wavfile

# Read wav file
fs, wav = wavfile.read('file.wav')

# Loading graph
sess = tf.Session()
with gfile.FastGFile('output_graph.pb', 'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())
    del(graph_def.node[-1])
    sess.graph.as_default()
    tf.import_graph_def(graph_def, name='') 
sess.run(tf.global_variables_initializer())

# Logits Tensor
logits = sess.graph.get_tensor_by_name('logits:0')

# Predict logits of an audio file
res = sess.run(logits, {'input_samples:0': wav})

The above snippet gives the following error:

ValueError: Cannot feed value of shape (112000,) for Tensor 'input_samples:0', which has shape '(512,)'

So what I wanted to know, to which tensor do we need to feed the raw audio data.

It is input_node: https://github.com/mozilla/DeepSpeech/blob/457198c88d7ad96ee4596cb21deaeca77c277898/native_client/tfmodelstate.cc#L222

Thanks. I ran the following command
sess.run(logits, {'input_node:0': features, 'input_lengths:0': features_len})

Got the following error:

FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition: Attempting to use uninitialized value previous_state_h
	 [[node previous_state_h/read (defined at /home/shantanu/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[GroupCrossDeviceControlEdges_0/logits/_30]]
  (1) Failed precondition: Attempting to use uninitialized value previous_state_h
	 [[node previous_state_h/read (defined at /home/shantanu/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Do we have to initialize the LSTM states manually?

Please read the code I linked, it explicitely answers your question.

Thanks. I read it and tested the code by initializing previous_state_h to random values.

sess.run(logits, {
    'input_node:0': features,
    'input_lengths:0': features_len,
    'previous_state_h:0': np.random.randn(1, 2048).astype(np.float32)
})

I got the following error:

InvalidArgumentError: Input 0 of node Assign_3 was passed float from _arg_previous_state_h_0_2:0 incompatible with expected float_ref.

Upon checking, I saw that the tensor previous_state_h was of type float_ref
<tf.Tensor 'previous_state_h:0' shape=(1, 2048) dtype=float32_ref>

But according to the code shared, the type defined there is float only. Can you please suggest me some pointers so that I can resolve this?

P.S. I am using deepspeech v0.5.1

Nice of you to say, the model is not the same …

Sorry, I should have specified that earlier.

But even in the native_client of v0.5.1 (filename: deepspeech.cc), the node is defined as
std::unique_ptr<float[]> previous_state_h_;

It’s float, but while feeding it a float array, it requires float_ref. How to feed it as float_ref. Most of the answers I looked online say converting tf.Variable to tf.placeholder will work but it won’t work here as I am doing inference on a frozen graph. Please suggest me some other way.

Please look at the python code that runs the training …

The states had to be initialized first using the initialize_states operation. Now it is working. Thanks a lot.

1 Like

Hi, after loading the exported graph, output_graph.pb, I saw that the input node is fixed to receive audio of only 16 timesteps
Tensor("input_node:0", shape=(1, 16, 19, 26), dtype=float32)

Is there any particular reason as to why only 16 timesteps?

Yes, @reuben can elaborate when he is back from holidays, but basically as much as I recall of the design it was a good balance between complexity (the longer the time, the higher) and accuracy (the longer the time, the better)

1 Like