Unexpected ResourceExhaustedError

shahdloo · May 24, 2018, 1:09pm

I am trying to fine-tune using the checkpoint from the latest release and using my own dataset. My wav file is ~22MB, and I’m assuming it’s not extraordinarily large.
Here are my machine specs:

1x Intel Core i7-6850K (6 cores, 3.6 GHz), 96 GB RAM, 11 GB GTX 1080 Ti, 12 GB Titan Xp

I assume that the training should go well with this configuration however I keep getting ResourceExhausted error and I am confused why.

Here’s my training script:

python -u /auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py \
  --train_files /auto/k1/shahdloo/Projs/stories-nn/data/stories/train.csv \
  --dev_files /auto/k1/shahdloo/Projs/stories-nn/data/stories/train.csv \
  --test_files /auto/k1/shahdloo/Projs/stories-nn/data/stories/train.csv \
  --n_hidden 2048 \
  --train_batch_size 1 \
  --dev_batch_size 1 \
  --test_batch_size 1 \
  --epoch 3 \
  --limit_train 1 \
  --limit_dev 1 \
  --log_level 0 \
  --limit_test 1 \
  --learning_rate 0.0001 \
  --dropout_rate 0.2367 \
  --default_stddev 0.046875 \
  --checkpoint_step 1 \
  --validation_step 1 \
  --wer_log_pattern "GLOBAL LOG: logwer('${COMPUTE_ID}', '%s', '%s', %f)" \
  --export_dir /auto/data/shahdloo/DeepSpeech/model_export/ \
  --checkpoint_dir /auto/data/shahdloo/DeepSpeech/checkpoint/ \
  --decoder_library_path /auto/k1/shahdloo/Projs/stories-nn/native_client/libctc_decoder_with_kenlm.so \
  --alphabet_config_path /auto/k1/shahdloo/Projs/stories-nn/data/alphabet.txt \
  --lm_binary_path /auto/k1/shahdloo/Projs/stories-nn/models/lm.binary \
  --lm_trie_path /auto/k1/shahdloo/Projs/stories-nn/models/trie

and here’s the tail of the error I get:

2018-05-24 15:50:07.474741: I tensorflow/core/common_runtime/bfc_allocator.cc:671]      Summary of in-use Chunks by size: 
2018-05-24 15:50:07.474755: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 47 Chunks of size 256 totalling 11.8KiB
2018-05-24 15:50:07.474764: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 1280 totalling 2.5KiB
2018-05-24 15:50:07.474772: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 516143 Chunks of size 8192 totalling 3.94GiB
2018-05-24 15:50:07.474779: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 12544 totalling 12.2KiB
2018-05-24 15:50:07.474788: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 15872 totalling 15.5KiB
2018-05-24 15:50:07.474795: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 70666 Chunks of size 16384 totalling 1.08GiB
2018-05-24 15:50:07.474803: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 23296 totalling 22.8KiB
2018-05-24 15:50:07.474810: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 63527 Chunks of size 24576 totalling 1.45GiB
2018-05-24 15:50:07.474818: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 25088 totalling 24.5KiB
2018-05-24 15:50:07.474825: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 996 Chunks of size 32768 totalling 31.12MiB
2018-05-24 15:50:07.474833: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 237568 totalling 232.0KiB
2018-05-24 15:50:07.474841: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 16777216 totalling 16.00MiB
2018-05-24 15:50:07.474848: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 33554432 totalling 64.00MiB
2018-05-24 15:50:07.474856: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 69820160 totalling 66.58MiB
2018-05-24 15:50:07.474864: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 72364032 totalling 138.02MiB
2018-05-24 15:50:07.474872: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 1 Chunks of size 144728064 totalling 138.02MiB
2018-05-24 15:50:07.474880: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 3 Chunks of size 201326592 totalling 576.00MiB
2018-05-24 15:50:07.474888: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 6 Chunks of size 289456128 totalling 1.62GiB
2018-05-24 15:50:07.474895: I tensorflow/core/common_runtime/bfc_allocator.cc:674] 2 Chunks of size 578912256 totalling 1.08GiB
2018-05-24 15:50:07.474903: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Sum Total of in-use chunks: 10.17GiB
2018-05-24 15:50:07.475010: I tensorflow/core/common_runtime/bfc_allocator.cc:680] Stats: 
Limit:                 10921944679
InUse:                 10921944064
MaxInUse:              10921944320
NumAllocs:                  986278
MaxAllocSize:            578912256

2018-05-24 15:50:07.495472: W tensorflow/core/common_runtime/bfc_allocator.cc:279] ****************************************************************************************************
2018-05-24 15:50:07.495523: W tensorflow/core/framework/op_kernel.cc:1202] OP_REQUIRES failed at cwise_ops_common.cc:70 : Resource exhausted: OOM when allocating tensor with shape[1,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:1 by allocator GPU_1_bfc
E OOM when allocating tensor with shape[1,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
E 	 [[Node: tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/split = Split[T=DT_FLOAT, num_split=4, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/Add/y, tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/BiasAdd)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 	 [[Node: tower_1/gradients/tower_1/MatMul_1_grad/tuple/control_dependency_1/_567 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_2276_tower_1/gradients/tower_1/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 
E Caused by op 'tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/split', defined at:
E   File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 1838, in <module>
E     tf.app.run()
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
E     _sys.exit(main(argv))
E   File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 1795, in main
E     train()
E   File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 1501, in train
E     results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
E   File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 640, in get_tower_results
E     calculate_mean_edit_distance_and_loss(model_feeder, i, no_dropout if optimizer is None else dropout_rates)
E   File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 521, in calculate_mean_edit_distance_and_loss
E     logits = BiRNN(batch_x, tf.to_int64(batch_seq_len), dropout)
E   File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 458, in BiRNN
E     sequence_length=seq_length)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 416, in bidirectional_dynamic_rnn
E     time_major=time_major, scope=fw_scope)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 632, in dynamic_rnn
E     dtype=dtype)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 829, in _dynamic_rnn_loop
E     swap_memory=swap_memory)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3096, in while_loop
E     result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2874, in BuildLoop
E     pred, body, original_loop_vars, loop_vars, shape_invariants)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2814, in _BuildLoop
E     body_result = body(*packed_vars_for_body)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3075, in <lambda>
E     body = lambda i, lv: (i + 1, orig_body(*lv))
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 798, in _time_step
E     skip_conditionals=True)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 249, in _rnn_step
E     new_output, new_state = call_cell()
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 786, in <lambda>
E     call_cell = lambda: cell(input_t, state)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 1056, in __call__
E     output, new_state = self._cell(inputs, state, scope=scope)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 296, in __call__
E     *args, **kwargs)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/layers/base.py", line 696, in __call__
E     outputs = self.call(inputs, *args, **kwargs)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 582, in call
E     value=gate_inputs, num_or_size_splits=4, axis=one)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 1366, in split
E     axis=axis, num_split=num_or_size_splits, value=value, name=name)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5069, in _split
E     "Split", split_dim=axis, value=value, num_split=num_split, name=name)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
E     op_def=op_def)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
E     op_def=op_def)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
E     self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
E 
E ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[Node: tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/split = Split[T=DT_FLOAT, num_split=4, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/Add/y, tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/BiasAdd)]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 	 [[Node: tower_1/gradients/tower_1/MatMul_1_grad/tuple/control_dependency_1/_567 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_2276_tower_1/gradients/tower_1/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
E Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
E 
E 
Traceback (most recent call last):
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1361, in _do_call
    return fn(*args)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1340, in _run_fn
    target_list, status, run_metadata)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/split = Split[T=DT_FLOAT, num_split=4, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/Add/y, tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/BiasAdd)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_1/gradients/tower_1/MatMul_1_grad/tuple/control_dependency_1/_567 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_2276_tower_1/gradients/tower_1/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 1595, in train
    step = session.run(global_step, feed_dict=feed_dict)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 546, in run
    run_metadata=run_metadata)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1022, in run
    run_metadata=run_metadata)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1113, in run
    raise six.reraise(*original_exc_info)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/six.py", line 693, in reraise
    raise value
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1098, in run
    return self._sess.run(*args, **kwargs)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 1170, in run
    run_metadata=run_metadata)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/training/monitored_session.py", line 950, in run
    return self._sess.run(*args, **kwargs)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 905, in run
    run_metadata_ptr)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1137, in _run
    feed_dict_tensor, options, run_metadata)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1355, in _do_run
    options, run_metadata)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1374, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/split = Split[T=DT_FLOAT, num_split=4, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/Add/y, tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/BiasAdd)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_1/gradients/tower_1/MatMul_1_grad/tuple/control_dependency_1/_567 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_2276_tower_1/gradients/tower_1/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/split', defined at:
  File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 1838, in <module>
    tf.app.run()
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
    _sys.exit(main(argv))
  File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 1795, in main
    train()
  File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 1501, in train
    results_tuple, gradients, mean_edit_distance, loss = get_tower_results(model_feeder, optimizer)
  File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 640, in get_tower_results
    calculate_mean_edit_distance_and_loss(model_feeder, i, no_dropout if optimizer is None else dropout_rates)
  File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 521, in calculate_mean_edit_distance_and_loss
    logits = BiRNN(batch_x, tf.to_int64(batch_seq_len), dropout)
  File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 458, in BiRNN
    sequence_length=seq_length)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 416, in bidirectional_dynamic_rnn
    time_major=time_major, scope=fw_scope)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 632, in dynamic_rnn
    dtype=dtype)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 829, in _dynamic_rnn_loop
    swap_memory=swap_memory)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3096, in while_loop
    result = loop_context.BuildLoop(cond, body, loop_vars, shape_invariants)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2874, in BuildLoop
    pred, body, original_loop_vars, loop_vars, shape_invariants)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 2814, in _BuildLoop
    body_result = body(*packed_vars_for_body)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/control_flow_ops.py", line 3075, in <lambda>
    body = lambda i, lv: (i + 1, orig_body(*lv))
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 798, in _time_step
    skip_conditionals=True)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 249, in _rnn_step
    new_output, new_state = call_cell()
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn.py", line 786, in <lambda>
    call_cell = lambda: cell(input_t, state)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 1056, in __call__
    output, new_state = self._cell(inputs, state, scope=scope)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 296, in __call__
    *args, **kwargs)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/layers/base.py", line 696, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/rnn_cell_impl.py", line 582, in call
    value=gate_inputs, num_or_size_splits=4, axis=one)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 1366, in split
    axis=axis, num_split=num_or_size_splits, value=value, name=name)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 5069, in _split
    "Split", split_dim=axis, value=value, num_split=num_split, name=name)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
    op_def=op_def)
  File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/split = Split[T=DT_FLOAT, num_split=4, _device="/job:localhost/replica:0/task:0/device:GPU:0"](tower_0/gradients/Add/y, tower_0/bidirectional_rnn/fw/fw/while/basic_lstm_cell/BiasAdd)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[Node: tower_1/gradients/tower_1/MatMul_1_grad/tuple/control_dependency_1/_567 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:1", send_device_incarnation=1, tensor_name="edge_2276_tower_1/gradients/tower_1/MatMul_1_grad/tuple/control_dependency_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


D Closing queues...
2018-05-24 15:52:11.719445: W tensorflow/core/kernels/queue_base.cc:277] _0_padding_fifo_queue_5: Skipping cancelled enqueue attempt with queue not closed
2018-05-24 15:52:11.719571: W tensorflow/core/kernels/queue_base.cc:277] _4_padding_fifo_queue_1: Skipping cancelled enqueue attempt with queue not closed
2018-05-24 15:52:11.719639: W tensorflow/core/kernels/queue_base.cc:277] _0_padding_fifo_queue_5: Skipping cancelled enqueue attempt with queue not closed
2018-05-24 15:52:11.719690: W tensorflow/core/kernels/queue_base.cc:277] _2_padding_fifo_queue_3: Skipping cancelled enqueue attempt with queue not closed
2018-05-24 15:52:11.719709: W tensorflow/core/kernels/queue_base.cc:277] _2_padding_fifo_queue_3: Skipping cancelled enqueue attempt with queue not closed
2018-05-24 15:52:11.719778: W tensorflow/core/kernels/queue_base.cc:277] _4_padding_fifo_queue_1: Skipping cancelled enqueue attempt with queue not closed
2018-05-24 15:52:11.719833: W tensorflow/core/kernels/queue_base.cc:277] _3_padding_fifo_queue_2: Skipping cancelled enqueue attempt with queue not closed
2018-05-24 15:52:11.719877: W tensorflow/core/kernels/queue_base.cc:277] _5_padding_fifo_queue_4: Skipping cancelled enqueue attempt with queue not closed
2018-05-24 15:52:11.719894: W tensorflow/core/kernels/queue_base.cc:277] _5_padding_fifo_queue_4: Skipping cancelled enqueue attempt with queue not closed
2018-05-24 15:52:11.719933: W tensorflow/core/kernels/queue_base.cc:277] _3_padding_fifo_queue_2: Skipping cancelled enqueue attempt with queue not closed
E You must feed a value for placeholder tensor 'Queue_Selector' with dtype int32
E 	 [[Node: Queue_Selector = Placeholder[dtype=DT_INT32, shape=<unknown>, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
E 
E Caused by op 'Queue_Selector', defined at:
E   File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 1838, in <module>
E     tf.app.run()
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/platform/app.py", line 126, in run
E     _sys.exit(main(argv))
E   File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 1795, in main
E     train()
E   File "/auto/k1/shahdloo/Projs/DeepSpeech/DeepSpeech.py", line 1489, in train
E     tower_feeder_count=len(available_devices))
E   File "/auto/k1/shahdloo/Projs/DeepSpeech/util/feeding.py", line 43, in __init__
E     self.ph_queue_selector = tf.placeholder(tf.int32, name='Queue_Selector')
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/array_ops.py", line 1746, in placeholder
E     return gen_array_ops._placeholder(dtype=dtype, shape=shape, name=name)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/ops/gen_array_ops.py", line 3051, in _placeholder
E     "Placeholder", dtype=dtype, shape=shape, name=name)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
E     op_def=op_def)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3271, in create_op
E     op_def=op_def)
E   File "/auto/k1/shahdloo/Projs/stories-nn/venv/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1650, in __init__
E     self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access
E 
E InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'Queue_Selector' with dtype int32
E 	 [[Node: Queue_Selector = Placeholder[dtype=DT_INT32, shape=<unknown>, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]
E 
E The checkpoint in /auto/data/shahdloo/DeepSpeech/checkpoint/ does not match the shapes of the model. Did you change alphabet.txt or the --n_hidden parameter between train runs using the same checkpoint dir? Try moving or removing the contents of /auto/data/shahdloo/DeepSpeech/checkpoint/.

One side note: At last, it complains about my n_hidden or alphabet, while they are both like what is used to train the model in the latest release, I assume…

Thanks in advance for your thoughts

lissyx · May 24, 2018, 1:15pm

Looks like you’re out of GPU memory. You already have batch size of 1, so either something is using your GPU’s memory, or your WAV file is just too big?

shahdloo · May 24, 2018, 1:23pm

Thanks for the quick reply.
That’s exactly what confuses me. GPUs are not used by any other process, so it’s 21GB free when I start. What are the typical sizes of the wav files used for training? may I break it into chunks and input them separately?

lissyx · May 24, 2018, 1:27pm

I don’t have a hint on the size of the files we use for training, but you could verify quickly by using the LDC93S1 sample that is in the repo.

Besides, you don’t have 21GB, one batch needs to fit on one GPU. Can you share nvidia-smi output?

shahdloo · May 24, 2018, 1:33pm

Here’s the output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.26                 Driver Version: 396.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:02:00.0 Off |                  N/A |
| 23%   26C    P8    16W / 250W |      0MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:03:00.0 Off |                  N/A |
|  0%   21C    P8    15W / 280W |      0MiB / 11177MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Well, the sample data is not realistic since it’s just one sentence of text.

lissyx · May 24, 2018, 1:35pm

The purpose is to make sure it’s at least working

lissyx · May 24, 2018, 1:36pm

Can you share details about your TensorFlow version, and the audio file you use?

shahdloo · May 24, 2018, 1:40pm

I just ran it on the sample data and it finished training successfully

lissyx · May 24, 2018, 1:44pm

Mono, 16kHz and 16 bits PCM ? How long does it makes it, being 22MB ?

shahdloo · May 24, 2018, 1:59pm

It is 11min of sound. I checked again, it’s mono and 16kHz. I’m not sure how to check for the 16 bits thing that you mentioned. ffmpeg gives this:

Duration: 00:11:46.69, bitrate: 256 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s

Is the 256kb/s problematic?

lissyx · May 24, 2018, 2:00pm

Nope, that looks fine. 11min is a bit long, but I don’t think it should be any problem. Trivially, you can split that, but I’m still a bit puzzled as to why.

shahdloo · May 24, 2018, 2:01pm

Thanks. I’ll try the splitting then…

lissyx · May 24, 2018, 2:06pm

@Tilman_Kamp @kdavis Do you remember what’s the biggest file we have during the training ?

kdavis · May 24, 2018, 7:05pm

I don’t remember but 11min sounds really long.

I’d guess the longest we train on is on the order of 20-30 seconds. But that’s a guess.