I tried a batch size of 8 and it still fails - also it fails at a similar point when using 16 and 24 - fails when near the end.
Epoch 0 | Training | Elapsed Time: 1:50:51 | Steps: 3471 | Loss: inf E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/18311.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/14902.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/13702.wav Epoch 0 | Training | Elapsed Time: 1:52:00 | Steps: 3482 | Loss: inf Traceback (most recent call last): File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node tower_0/dropout_3/GreaterEqual}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[concat/concat/_119]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (1) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[{{node tower_0/dropout_3/GreaterEqual}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 0 successful operations. 0 derived errors ignored. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "DeepSpeech.py", line 12, in <module> ds_train.run_script() File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script absl.app.run(main) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/DeepSpeech/training/deepspeech_training/train.py", line 954, in main train() File "/DeepSpeech/training/deepspeech_training/train.py", line 607, in train train_loss, _ = run_set('train', epoch, train_init_op) File "/DeepSpeech/training/deepspeech_training/train.py", line 572, in run_set feed_dict=feed_dict) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run run_metadata_ptr) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run run_metadata) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found. (0) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node tower_0/dropout_3/GreaterEqual (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. [[concat/concat/_119]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. (1) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[node tower_0/dropout_3/GreaterEqual (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 0 successful operations. 0 derived errors ignored. Original stack trace for 'tower_0/dropout_3/GreaterEqual': File "DeepSpeech.py", line 12, in <module> ds_train.run_script() File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script absl.app.run(main) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "/DeepSpeech/training/deepspeech_training/train.py", line 954, in main train() File "/DeepSpeech/training/deepspeech_training/train.py", line 484, in train gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates) File "/DeepSpeech/training/deepspeech_training/train.py", line 317, in get_tower_results avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0) File "/DeepSpeech/training/deepspeech_training/train.py", line 244, in calculate_mean_edit_distance_and_loss logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl) File "/DeepSpeech/training/deepspeech_training/train.py", line 204, in create_model layers['layer_5'] = layer_5 = dense('layer_5', output, Config.n_hidden_5, dropout_rate=dropout[5], layer_norm=FLAGS.layer_norm) File "/DeepSpeech/training/deepspeech_training/train.py", line 93, in dense output = tf.nn.dropout(output, rate=dropout_rate) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 4229, in dropout return dropout_v2(x, rate, noise_shape=noise_shape, seed=seed, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 4313, in dropout_v2 keep_mask = random_tensor >= rate File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_math_ops.py", line 4481, in greater_equal "GreaterEqual", x=x, y=y, name=name) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__ self._traceback = tf_stack.extract_stack()
I have excluded files that are greater than 2 mb - it shouldnt be possible for 8x2 mb=16 mb to cause a 4gb ram gpu to go out of memory, correct me if there is some behaviour i am unaware about. Most files are around 250kb.
The face that it OOM’s towards then end is suspicious of some kind of memory leak.
Will retry with 4 batch size …