Inference prediction with own trained model

david0861 · September 7, 2018, 9:46am

OS Platform and Distribution: Ubuntu 16.04
Python version: 3.5.2
CUDA/cuDNN version: 9.0/7.1
GPU model and memory: 2 x gtx 1080 ti

I have trained my own Spanish model with my own data (8 kHz) and I would like make predictions but I can’t do it. (I modified client.py script to allow 8KHz audios)

Experiment 1:

Deepspeech: 0.2.0a8
TensorFlow version: tensorflow-warpctc 1.6.0 (build from source)
Model trained with deepspeech 0.2.0a8

$ python native_client/python/client.py --model models/output_graph_0a8.pb
--alphabet models/alphabet.txt --lm models/5gram.klm --trie models/trie --audio data/audio_1.wav

Loading model from file models/output_graph_0a8.pb
TensorFlow: v1.6.0-16-gc346f2c
DeepSpeech: v0.2.0-alpha.8-0-gcd47560
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-09-07 08:48:42.331571: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Not found: Op type not registered 'BlockLSTM' in binary running on david-desktop. Make sure the Op and Kernel are registered in the binary running in this process.
Loaded model in 0.118s.
Loading language model from files models/5gram.klm models/trie
Loaded language model in 0.0787s.
Running inference.
Segmentation fault (core dumped)

Experiment 2:

Deepspeech: 0.2.0a9
TensorFlow version: tensorflow-warpctc 1.6.0 (build from source)
Model trained with deepspeech 0.2.0a9. In this case it’s seems that make the prediction but is wrong since the audio is used in the train dataset.

$ python native_client/python/client.py --model models/output_graph_0a9.pb 
--alphabet models/alphabet.txt --lm models/5gram.klm --trie models/trie --audio data/audio_0.wav

Loading model from file models/output_graph_0a9.pb
TensorFlow: v1.6.0-18-g5021473
DeepSpeech: v0.2.0-alpha.9-0-gd59cdc3
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-09-07 09:42:49.747845: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.085s.
Loading language model from files models/5gram.klm models/trie
Loaded language model in 0.192s.
Running inference.
te te v pal
Inference took 3.998s for 10.450s audio file.

I’ve trained the model with 3 differents datasets and get a razonable WER. So I don’t know how I get such a bad prediction with train data.

Some users said that the reason is hyperparameters. But If I have a model with a low WER why this model is not able to generate a good transcription from the train data?

$ CUDA_VISIBLE_DEVICES=0,1 ./DeepSpeech.py \
--train_files data/train.csv \
--dev_files data/dev.csv \
--test_files data/test.csv \
--decoder_library_path models/language/libctc_decoder_with_kenlm.so \
--lm_binary_path models/language/5gram.klm \
--lm_trie_path models/language/trie \
--alphabet_config_path models/language/alphabet.txt \
--train_batch_size 64 \
--dev_batch_size 64 \
--test_batch_size 64 \
--n_hidden 2048 \
--epoch 150 \
--checkpoint_dir models/session/ \
--summary_dir models/summary/ \
--summary_secs 1756 \
--export_dir models/modelo/ \
--validation_step 20 \
--earlystop_nsteps 4 \
--display_step 20

lissyx · September 7, 2018, 9:50am

Thanks @david0861. So as I said on Github, your first experiment result is really strange, there’s no reason your models/output_graph_0a8.pb file would contain BlockLSTM unless it’s not based on v0.2.0-alpha.8 but rather on v0.2.0-alpha.9.

You should also not be required to rebuild TensorFlow for training, it’d be better you use upstream (we do).

lissyx · September 7, 2018, 9:51am

Well, you should share the whole training output and give figures for your “low WER”, as well as document your training, dev and test datasets.

lissyx · September 7, 2018, 9:51am

FYI, @reuben is currently working on proper training on English with this new model, we are still exploring the proper hyperparameters required.

david0861 · September 7, 2018, 9:54am

Thanks @lissyx . I’m going to check it to be sure that I’m really using the v0.2.0-alpha.8 and share my documentation about the training.

lissyx · September 7, 2018, 9:58am

I really insist on that, because we are also still training a first version of this new model, so we don’t have a lot of knowledge yet

david0861 · September 17, 2018, 2:10pm

@lissyx Well, I was wrong with my virtual environment so I didn’t working with v0.2.0-alpha.8. Now when I’m trying to train with v0.2.0-alpha.8, I get the error that follows:

(deepspeech2_env) $ CUDA_VISIBLE_DEVICES=0,1 ./DeepSpeech.py --train_files data/train.csv --dev_files data/dev.csv --test_files data/test.csv --decoder_library_path /models/language/libctc_decoder_with_kenlm.so --lm_binary_path models/language/5gram.klm --lm_trie_path /models/language/trie --alphabet_config_path models/language/alphabet.txt --train_batch_size 64 --dev_batch_size 64 --test_batch_size 64 --n_hidden 2048 --epoch 20 --checkpoint_dir /models/session/ --summary_dir models/summary/ --summary_secs 1756 --export_dir models/modelo/ --validation_step 25
I STARTING Optimization
E OOM when allocating tensor with shape[6144,8192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
E 	 [[Node: tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1/StackPopV2, tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/BiasAdd_grad/tuple/control_dependency)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 	 [[Node: tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1/_1283 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4918_tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 
E Caused by op 'tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1', defined at:
E   File "./DeepSpeech.py", line 1870, in <module>
E     tf.app.run()
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
E     _sys.exit(main(argv))
E   File "./DeepSpeech.py", line 1827, in main
E     train()
E   File "./DeepSpeech.py", line 1500, in train
E     results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
E   File "./DeepSpeech.py", line 653, in get_tower_results
E     gradients = optimizer.compute_gradients(avg_loss)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/optimizer.py", line 460, in compute_gradients
E     colocate_gradients_with_ops=colocate_gradients_with_ops)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 611, in gradients
E     lambda: grad_fn(op, *out_grads))
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 377, in _MaybeCompile
E     return grad_fn()  # Exit early
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 611, in <lambda>
E     lambda: grad_fn(op, *out_grads))
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/math_grad.py", line 973, in _MatMulGrad
E     grad_b = math_ops.matmul(a, grad, transpose_a=True)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 2064, in matmul
E     a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2507, in _mat_mul
E     name=name)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
E     op_def=op_def)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
E     op_def=op_def)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
E     self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
E 
E ...which was originally created as op 'tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul', defined at:
E   File "./DeepSpeech.py", line 1870, in <module>
E     tf.app.run()
E [elided 2 identical lines from previous traceback]
E   File "./DeepSpeech.py", line 1500, in train
E     results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
E   File "./DeepSpeech.py", line 635, in get_tower_results
E     calculate_mean_edit_distance_and_loss(model_feeder, i, dropout_rates)
E   File "./DeepSpeech.py", line 516, in calculate_mean_edit_distance_and_loss
E     logits = BiRNN(batch_x, tf.to_int64(batch_seq_len), dropout)
E   File "./DeepSpeech.py", line 453, in BiRNN
E     sequence_length=seq_length)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 442, in bidirectional_dynamic_rnn
E     time_major=time_major, scope=bw_scope)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 632, in dynamic_rnn
E     dtype=dtype)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 829, in _dynamic_rnn_loop
E     swap_memory=swap_memory)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3096, in while_loop
E     result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2874, in BuildLoop
E     pred, body, original_loop_vars, loop_vars, shape_invariants)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2814, in _BuildLoop
E     body_result = body(*packed_vars_for_body)
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3075, in <lambda>
E     body = lambda i, lv: (i + 1, orig_body(*lv))
E   File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 798, in _time_step
E     skip_conditionals=True)
E 
E ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[6144,8192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
E 	 [[Node: tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1/StackPopV2, tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/BiasAdd_grad/tuple/control_dependency)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 	 [[Node: tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1/_1283 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4918_tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 
Traceback (most recent call last):
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[6144,8192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1/StackPopV2, tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/BiasAdd_grad/tuple/control_dependency)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1/_1283 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4918_tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./DeepSpeech.py", line 1666, in train
    _, current_step, batch_loss, batch_report, step_summary = session.run([train_op, global_step, loss, report_params, step_summaries_op], **extra_params)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 546, in run
    run_metadata=run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1022, in run
    run_metadata=run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1113, in run
    raise six.reraise(*original_exc_info)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1098, in run
    return self._sess.run(*args, **kwargs)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1170, in run
    run_metadata=run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 950, in run
    return self._sess.run(*args, **kwargs)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[6144,8192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1/StackPopV2, tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/BiasAdd_grad/tuple/control_dependency)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1/_1283 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4918_tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1', defined at:
  File "./DeepSpeech.py", line 1870, in <module>
    tf.app.run()
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "./DeepSpeech.py", line 1827, in main
    train()
  File "./DeepSpeech.py", line 1500, in train
    results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
  File "./DeepSpeech.py", line 653, in get_tower_results
    gradients = optimizer.compute_gradients(avg_loss)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/optimizer.py", line 460, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 611, in gradients
    lambda: grad_fn(op, *out_grads))
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 377, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 611, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/math_grad.py", line 973, in _MatMulGrad
    grad_b = math_ops.matmul(a, grad, transpose_a=True)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 2064, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2507, in _mat_mul
    name=name)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

...which was originally created as op 'tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul', defined at:
  File "./DeepSpeech.py", line 1870, in <module>
    tf.app.run()
[elided 2 identical lines from previous traceback]
  File "./DeepSpeech.py", line 1500, in train
    results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
  File "./DeepSpeech.py", line 635, in get_tower_results
    calculate_mean_edit_distance_and_loss(model_feeder, i, dropout_rates)
  File "./DeepSpeech.py", line 516, in calculate_mean_edit_distance_and_loss
    logits = BiRNN(batch_x, tf.to_int64(batch_seq_len), dropout)
  File "./DeepSpeech.py", line 453, in BiRNN
    sequence_length=seq_length)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 442, in bidirectional_dynamic_rnn
    time_major=time_major, scope=bw_scope)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 632, in dynamic_rnn
    dtype=dtype)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 829, in _dynamic_rnn_loop
    swap_memory=swap_memory)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3096, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2874, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2814, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3075, in <lambda>
    body = lambda i, lv: (i + 1, orig_body(*lv))
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 798, in _time_step
    skip_conditionals=True)

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[6144,8192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1/StackPopV2, tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/BiasAdd_grad/tuple/control_dependency)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1/_1283 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4918_tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Traceback (most recent call last):
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[6144,8192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1/StackPopV2, tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/BiasAdd_grad/tuple/control_dependency)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1/_1283 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4918_tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./DeepSpeech.py", line 1666, in train
    _, current_step, batch_loss, batch_report, step_summary = session.run([train_op, global_step, loss, report_params, step_summaries_op], **extra_params)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 546, in run
    run_metadata=run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1022, in run
    run_metadata=run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1113, in run
    raise six.reraise(*original_exc_info)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1098, in run
    return self._sess.run(*args, **kwargs)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1170, in run
    run_metadata=run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 950, in run
    return self._sess.run(*args, **kwargs)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[6144,8192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1/StackPopV2, tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/BiasAdd_grad/tuple/control_dependency)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1/_1283 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4918_tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1', defined at:
  File "./DeepSpeech.py", line 1870, in <module>
    tf.app.run()
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "./DeepSpeech.py", line 1827, in main
    train()
  File "./DeepSpeech.py", line 1500, in train
    results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
  File "./DeepSpeech.py", line 653, in get_tower_results
    gradients = optimizer.compute_gradients(avg_loss)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/optimizer.py", line 460, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 611, in gradients
    lambda: grad_fn(op, *out_grads))
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 377, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gradients_impl.py", line 611, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/math_grad.py", line 973, in _MatMulGrad
    grad_b = math_ops.matmul(a, grad, transpose_a=True)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 2064, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 2507, in _mat_mul
    name=name)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

...which was originally created as op 'tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul', defined at:
  File "./DeepSpeech.py", line 1870, in <module>
    tf.app.run()
[elided 2 identical lines from previous traceback]
  File "./DeepSpeech.py", line 1500, in train
    results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
  File "./DeepSpeech.py", line 635, in get_tower_results
    calculate_mean_edit_distance_and_loss(model_feeder, i, dropout_rates)
  File "./DeepSpeech.py", line 516, in calculate_mean_edit_distance_and_loss
    logits = BiRNN(batch_x, tf.to_int64(batch_seq_len), dropout)
  File "./DeepSpeech.py", line 453, in BiRNN
    sequence_length=seq_length)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 442, in bidirectional_dynamic_rnn
    time_major=time_major, scope=bw_scope)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 632, in dynamic_rnn
    dtype=dtype)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 829, in _dynamic_rnn_loop
    swap_memory=swap_memory)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3096, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2874, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2814, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3075, in <lambda>
    body = lambda i, lv: (i + 1, orig_body(*lv))
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 798, in _time_step
    skip_conditionals=True)

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[6144,8192] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1 = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/MatMul_grad/MatMul_1/StackPopV2, tower_0/gradients/tower_0/bidirectional_rnn/bw/bw/while/basic_lstm_cell/BiasAdd_grad/tuple/control_dependency)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1/_1283 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_4918_tower_0/gradients/tower_0/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./DeepSpeech.py", line 1870, in <module>
    tf.app.run()
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "./DeepSpeech.py", line 1827, in main
    train()
  File "./DeepSpeech.py", line 1698, in train
    hook.end(session)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 463, in end
    self._save(session, last_step)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 474, in _save
    self._get_saver().save(session, self._save_path, global_step=step)
  File "/home/user0/deepspeech2_env/lib/python3.5/site-packages/tensorflow/python/training/saver.py", line 1646, in save
    raise TypeError("'sess' must be a Session; %s" % sess)
TypeError: 'sess' must be a Session; <tensorflow.python.training.monitored_session.MonitoredSession object at 0x7fd90d07ef60>
[1]+  Terminado (killed)

lissyx · September 17, 2018, 2:42pm

You’re pushing too much data to your GPU, it’s OOM-ing.

david0861 · September 19, 2018, 7:12am

Well, finally I’ll able to train and run inference with v0.2.0-alpha.8. My problems was: using different versions of deepspeech and select big batch size for my memory GPU. Thanks for the support @lissyx

Now I have to questions:
1- What do you mean with upstream Tensorflow? (Sorry in spanish is not clear the term upstream)
2- Do you recommend move to the new realease Deep Speech 0.2.0?

lissyx · September 19, 2018, 7:33am

The one from tensorflow.org, published on pypi: tensorflow · PyPI

Of course

Topic		Replies	Views
Fine tuning Deepspeech 0.9.1 with same alphabet DeepSpeech learning	40	1494	December 4, 2020
Running Deepspeech 0.7.4 on Google Commands Dataset DeepSpeech	24	1158	July 24, 2020
Error when training model DeepSpeech	95	4584	January 17, 2019
Failed using my own model DeepSpeech	26	3669	August 16, 2019
Fatal Python error when run DeepSpeech DeepSpeech	8	1679	February 14, 2020

Inference prediction with own trained model

Related topics