Is there a way to get the logits (output of the softmax layer) during inference from the exported graph - output_graph.pb? I want to use my own CTC decoder and thus looking for a way to predict just the logits from the graph.
Is there a way to achieve this through the python deepspeech API? If not, can you please point me to some other way I can achieve this?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Yeah, I did that. Found the logits tensor. Can you also please tell me how to send input data? I found the node named input_samples takes in as input of size (512,). Does that correspond to raw audio data?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
Are we talking about the inference code or the training code ?
Inferencing. I am trying to load output_graph.pb and use it to extract logits from a audio file. What I am not able to understand is that which one is the input node in the graph which accepts the audio file as input?
Here is my code so far:
import tensorflow as tf
from tensorflow.python.platform import gfile
from scipy.io import wavfile
# Read wav file
fs, wav = wavfile.read('file.wav')
# Loading graph
sess = tf.Session()
with gfile.FastGFile('output_graph.pb', 'rb') as f:
graph_def = tf.GraphDef()
graph_def.ParseFromString(f.read())
del(graph_def.node[-1])
sess.graph.as_default()
tf.import_graph_def(graph_def, name='')
sess.run(tf.global_variables_initializer())
# Logits Tensor
logits = sess.graph.get_tensor_by_name('logits:0')
# Predict logits of an audio file
res = sess.run(logits, {'input_samples:0': wav})
The above snippet gives the following error:
ValueError: Cannot feed value of shape (112000,) for Tensor 'input_samples:0', which has shape '(512,)'
So what I wanted to know, to which tensor do we need to feed the raw audio data.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
Thanks. I ran the following command sess.run(logits, {'input_node:0': features, 'input_lengths:0': features_len})
Got the following error:
FailedPreconditionError: 2 root error(s) found.
(0) Failed precondition: Attempting to use uninitialized value previous_state_h
[[node previous_state_h/read (defined at /home/shantanu/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[GroupCrossDeviceControlEdges_0/logits/_30]]
(1) Failed precondition: Attempting to use uninitialized value previous_state_h
[[node previous_state_h/read (defined at /home/shantanu/.local/lib/python3.6/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Do we have to initialize the LSTM states manually?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
8
Please read the code I linked, it explicitely answers your question.
But even in the native_client of v0.5.1 (filename: deepspeech.cc), the node is defined as std::unique_ptr<float[]> previous_state_h_;
It’s float, but while feeding it a float array, it requires float_ref. How to feed it as float_ref. Most of the answers I looked online say converting tf.Variable to tf.placeholder will work but it won’t work here as I am doing inference on a frozen graph. Please suggest me some other way.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
12
Please look at the python code that runs the training …
Hi, after loading the exported graph, output_graph.pb, I saw that the input node is fixed to receive audio of only 16 timesteps Tensor("input_node:0", shape=(1, 16, 19, 26), dtype=float32)
Is there any particular reason as to why only 16 timesteps?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
15
Yes, @reuben can elaborate when he is back from holidays, but basically as much as I recall of the design it was a good balance between complexity (the longer the time, the higher) and accuracy (the longer the time, the better)