Getting logits as output

shan18 · August 8, 2020, 2:13pm

Is there a way to get the logits (output of the softmax layer) during inference from the exported graph - output_graph.pb? I want to use my own CTC decoder and thus looking for a way to predict just the logits from the graph.

Is there a way to achieve this through the python deepspeech API? If not, can you please point me to some other way I can achieve this?

lissyx · August 9, 2020, 10:39am

no

just access logits tensor?

shan18 · August 10, 2020, 7:19am

Yeah, I did that. Found the logits tensor. Can you also please tell me how to send input data? I found the node named input_samples takes in as input of size (512,). Does that correspond to raw audio data?

lissyx · August 10, 2020, 8:50am

Are we talking about the inference code or the training code ?

shan18 · August 10, 2020, 9:12am

Inferencing. I am trying to load output_graph.pb and use it to extract logits from a audio file. What I am not able to understand is that which one is the input node in the graph which accepts the audio file as input?

Here is my code so far:

import tensorflow as tf
from tensorflow.python.platform import gfile
from scipy.io import wavfile

# Read wav file
fs, wav = wavfile.read('file.wav')

# Loading graph
sess = tf.Session()
with gfile.FastGFile('output_graph.pb', 'rb') as f:
    graph_def = tf.GraphDef()
    graph_def.ParseFromString(f.read())
    del(graph_def.node[-1])
    sess.graph.as_default()
    tf.import_graph_def(graph_def, name='') 
sess.run(tf.global_variables_initializer())

# Logits Tensor
logits = sess.graph.get_tensor_by_name('logits:0')

# Predict logits of an audio file
res = sess.run(logits, {'input_samples:0': wav})

The above snippet gives the following error:

ValueError: Cannot feed value of shape (112000,) for Tensor 'input_samples:0', which has shape '(512,)'

So what I wanted to know, to which tensor do we need to feed the raw audio data.

lissyx · August 10, 2020, 11:28am

It is input_node: https://github.com/mozilla/DeepSpeech/blob/457198c88d7ad96ee4596cb21deaeca77c277898/native_client/tfmodelstate.cc#L222

shan18 · August 10, 2020, 1:37pm

Thanks. I ran the following command
sess.run(logits, {'input_node:0': features, 'input_lengths:0': features_len})

Got the following error:

FailedPreconditionError: 2 root error(s) found.
  (0) Failed precondition: Attempting to use uninitialized value previous_state_h
	 [[node previous_state_h/read (defined at /home/shantanu/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[GroupCrossDeviceControlEdges_0/logits/_30]]
  (1) Failed precondition: Attempting to use uninitialized value previous_state_h
	 [[node previous_state_h/read (defined at /home/shantanu/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

Do we have to initialize the LSTM states manually?

lissyx · August 10, 2020, 1:52pm

Please read the code I linked, it explicitely answers your question.

shan18 · August 11, 2020, 7:41am

Thanks. I read it and tested the code by initializing previous_state_h to random values.

sess.run(logits, {
    'input_node:0': features,
    'input_lengths:0': features_len,
    'previous_state_h:0': np.random.randn(1, 2048).astype(np.float32)
})

I got the following error:

InvalidArgumentError: Input 0 of node Assign_3 was passed float from _arg_previous_state_h_0_2:0 incompatible with expected float_ref.

Upon checking, I saw that the tensor previous_state_h was of type float_ref
<tf.Tensor 'previous_state_h:0' shape=(1, 2048) dtype=float32_ref>

But according to the code shared, the type defined there is float only. Can you please suggest me some pointers so that I can resolve this?

P.S. I am using deepspeech v0.5.1

lissyx · August 11, 2020, 9:16am

Nice of you to say, the model is not the same …

shan18 · August 11, 2020, 9:47am

Sorry, I should have specified that earlier.

But even in the native_client of v0.5.1 (filename: deepspeech.cc), the node is defined as
std::unique_ptr<float[]> previous_state_h_;

It’s float, but while feeding it a float array, it requires float_ref. How to feed it as float_ref. Most of the answers I looked online say converting tf.Variable to tf.placeholder will work but it won’t work here as I am doing inference on a frozen graph. Please suggest me some other way.

lissyx · August 11, 2020, 9:55am

Please look at the python code that runs the training …

shan18 · August 11, 2020, 11:52am

The states had to be initialized first using the initialize_states operation. Now it is working. Thanks a lot.

shan18 · August 12, 2020, 1:19pm

Hi, after loading the exported graph, output_graph.pb, I saw that the input node is fixed to receive audio of only 16 timesteps
Tensor("input_node:0", shape=(1, 16, 19, 26), dtype=float32)

Is there any particular reason as to why only 16 timesteps?

lissyx · August 12, 2020, 1:44pm

Yes, @reuben can elaborate when he is back from holidays, but basically as much as I recall of the design it was a good balance between complexity (the longer the time, the higher) and accuracy (the longer the time, the better)

Topic		Replies	Views
How can we get log probabilities of decoded outputs DeepSpeech	7	1123	January 22, 2019
Intermediate layer(BiRNN) output for an audio file DeepSpeech	3	455	March 26, 2019
Can I use pre-trained model with DeepSpeech.py? DeepSpeech	8	3774	December 27, 2019
What are the output nodes of deepspeech? DeepSpeech	0	321	July 30, 2021
Access activations of the neurons DeepSpeech	14	1341	April 16, 2020

Getting logits as output

Related topics