Help with japanese model

Shravan_Shetty · December 14, 2020, 12:14pm

I have been trying to create japanese model - I have collected around 70 hours of audio.
While training the model in docker - i receive these errors

root@80dfb52cfddf:/DeepSpeech# python -u DeepSpeech.py \
>   --train_files /home/anon/Downloads/jaSTTDatasets/final-train.csv \
>   --train_batch_size 24 \
>   --dev_files /home/anon/Downloads/jaSTTDatasets/final-dev.csv \
>   --dev_batch_size 24 \
>   --test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv \
>   --test_batch_size 24 \
>   --epochs 5 \
>   --bytes_output_mode \
>   --checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint
I Could not find best validating checkpoint.
I Loading most recent checkpoint from /home/anon/Downloads/jaSTTDatasets/checkpoint/train-976
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam_1
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:00:24 | Steps: 22 | Loss: 26.007061                         E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/5350.wav
Epoch 0 |   Training | Elapsed Time: 0:00:26 | Steps: 24 | Loss: inf                               E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/21545.wav
Epoch 0 |   Training | Elapsed Time: 0:00:48 | Steps: 46 | Loss: inf                               E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/12658.wav
Epoch 0 |   Training | Elapsed Time: 0:00:55 | Steps: 53 | Loss: inf                               E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/5370.wav
Epoch 0 |   Training | Elapsed Time: 0:00:56 | Steps: 54 | Loss: inf                               E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/12779.wav
Epoch 0 |   Training | Elapsed Time: 0:01:29 | Steps: 83 | Loss: inf                               E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/5369.wav
Epoch 0 |   Training | Elapsed Time: 0:01:34 | Steps: 87 | Loss: inf                               E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/708.wav
Epoch 0 |   Training | Elapsed Time: 0:02:19 | Steps: 126 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/804.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/787.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/926.wav
Epoch 0 |   Training | Elapsed Time: 0:02:25 | Steps: 131 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/966.wav
Epoch 0 |   Training | Elapsed Time: 0:03:17 | Steps: 172 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/1412.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/1009.wav
Epoch 0 |   Training | Elapsed Time: 0:04:20 | Steps: 219 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/549.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/138.wav
Epoch 0 |   Training | Elapsed Time: 0:04:49 | Steps: 239 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/1445.wav
Epoch 0 |   Training | Elapsed Time: 0:05:30 | Steps: 267 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/575.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/583.wav
Epoch 0 |   Training | Elapsed Time: 0:05:35 | Steps: 271 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/25882.wav
Epoch 0 |   Training | Elapsed Time: 0:05:37 | Steps: 272 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/543.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/660.wav
Epoch 0 |   Training | Elapsed Time: 0:06:34 | Steps: 310 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/16574.wav
Epoch 0 |   Training | Elapsed Time: 0:06:36 | Steps: 311 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/123.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/23026.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/25289.wav
Epoch 0 |   Training | Elapsed Time: 0:06:39 | Steps: 313 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/16154.wav
Epoch 0 |   Training | Elapsed Time: 0:06:40 | Steps: 314 | Loss: inf                              E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/3614.wav

Similar issue has already been addressed in topic I am getting Validation Loss: inf however the solution states that we should just run training and read console messages for corrupt files.
Is there any other way to easily filter out problamatic files, the files seem to have a proper header and have some audio when opened in VLC. Its also in format mono, 16khz.
I have picked 2 of the problem files at random and attached them, please review them as well.

Since i have a small dataset, i would like to repair these files rather than delete them.
Current csv has a record of around 35000 files.problemFiles.zip (171.2 KB)

othiele · December 14, 2020, 8:06am

Thanks for opening a new thread.

I am not sure it is the files. They look fine. 70 hours is not much, but you should get results.

Have you done everything according to the docs? My guess is, that is has to do with the byte_output_mode or the learning rate. You could try a steeper learning rate of 0.01 just for fun. This shouldn’t take more than a couple of minutes.

Do you have any idea @lissyx?

Shravan_Shetty · December 14, 2020, 8:37am

I tried to use learning_rate of 0.01 instead of default 0.001 and it had no effect.
The same files are causing infinite loss.

Also encountering OOM errors - assuming this is caused by a batch not being able to fit into gpu memory.
I have removed all audio files longer than 1 min from set and reduced batch size to 16. Also randomized csv file such that big files are not grouped together. It would be nice if it says which batch is causing OOM error but it doesnt seem to say.
GPU is GeForce GTX 1650.

othiele · December 14, 2020, 8:35am

But you did follow the docs? As you haven’t clicked the link, you might not … otherwise let’s wait for lissyx. But that might take a day or so.

Shravan_Shetty · December 14, 2020, 8:56am

Thank you for the reply,
Yes i have read the docs - I re-read it just now to make sure I have not missing anything. I have not used language model and augmentations since this is just a prototype. Once this works i will scrape more tedious data like the bible .etc. and also use language model and augmentations.

Also i was able to successfully get bytes_output_mode on very small data - like 5 audio files as its done in ./bin/run-ldc93s1.sh. Just having trouble getting it work on this larger dataset.

othiele · December 14, 2020, 9:00am

OK, different question. What is your error message? Or does it simply stop. In the above output I can only see the inf loss which is not a problem in itself as training does continue. If it simply stops, maybe use a lower batch size for training and see whether 1 whole epoch runs through.

lissyx · December 14, 2020, 9:15am

yes: make sure you pass clean data to deepspeech, we can’t handle that problem at our level.

sorry, we don’t have time for that.

4GB RAM, not a lot, don’t expect too much.

do you understand what it does?

Shravan_Shetty · December 14, 2020, 9:44am

Was able to recreate the oom error -

This was the command used

python -u DeepSpeech.py
–train_files /home/anon/Downloads/jaSTTDatasets/final-train.csv
–train_batch_size 16
–dev_files /home/anon/Downloads/jaSTTDatasets/final-dev.csv
–dev_batch_size 16
–test_files /home/anon/Downloads/jaSTTDatasets/final-test.csv
–test_batch_size 16
–epochs 5
–bytes_output_mode
–checkpoint_dir /home/anon/Downloads/jaSTTDatasets/checkpoint

These logs were generated - could not fit entire log due to charector limit on website.
Only added logs just before it started erroring out.

Epoch 0 |   Training | Elapsed Time: 0:57:44 | Steps: 1700 | Loss: inf                             E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/26595.wav
Epoch 0 |   Training | Elapsed Time: 1:00:07 | Steps: 1732 | Loss: inf                             E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/27100.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/20379.wav
Epoch 0 |   Training | Elapsed Time: 1:00:12 | Steps: 1733 | Loss: inf                             Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[12608,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node tower_0/gradients/tower_0/MatMul_4_grad/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[tower_0/gradients/tower_0/BiasAdd_1_grad/BiasAddGrad/_107]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[12608,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node tower_0/gradients/tower_0/MatMul_4_grad/MatMul}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 954, in main
    train()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 607, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 572, in run_set
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[12608,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node tower_0/gradients/tower_0/MatMul_4_grad/MatMul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[tower_0/gradients/tower_0/BiasAdd_1_grad/BiasAddGrad/_107]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[12608,2048] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node tower_0/gradients/tower_0/MatMul_4_grad/MatMul (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'tower_0/gradients/tower_0/MatMul_4_grad/MatMul':
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 954, in main
    train()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 484, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 326, in get_tower_results
    gradients = optimizer.compute_gradients(avg_loss)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/optimizer.py", line 512, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 350, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_grad.py", line 1585, in _MatMulGrad
    grad_a = gen_math_ops.mat_mul(grad, b, transpose_b=True)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6136, in mat_mul
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'tower_0/MatMul_4', defined at:
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
[elided 4 identical lines from previous traceback]
  File "/DeepSpeech/training/deepspeech_training/train.py", line 484, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 317, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 244, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 207, in create_model
    layers['layer_6'] = layer_6 = dense('layer_6', layer_5, Config.n_hidden_6, relu=False)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 83, in dense
    output = tf.nn.bias_add(tf.matmul(x, weights), bias)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/dispatch.py", line 180, in wrapper
    return target(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_ops.py", line 2754, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_math_ops.py", line 6136, in mat_mul
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

^CProcess ForkPoolWorker-7:
Process ForkPoolWorker-5:
Process ForkPoolWorker-8:
Process ForkPoolWorker-6:
Process ForkPoolWorker-4:
Process ForkPoolWorker-3:
Process ForkPoolWorker-2:
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
Process ForkPoolWorker-1:
    finalizer()
  File "/usr/lib/python3.6/multiprocessing/util.py", line 186, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 571, in _terminate_pool
    cls._help_stuff_finish(inqueue, task_handler, len(pool))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 556, in _help_stuff_finish
    inqueue._rlock.acquire()
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
KeyboardInterrupt
KeyboardInterrupt
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
Traceback (most recent call last):
KeyboardInterrupt
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 335, in get
    res = self._reader.recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
KeyboardInterrupt
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
^Croot@80dfb52cfddf:/DeepSpeech#

Shravan_Shetty · December 14, 2020, 9:57am

Shravan_Shetty:

Is there any other way to easily filter out problamatic files

yes: make sure you pass clean data to deepspeech, we can’t handle that problem at our level.

Shravan_Shetty:

I have picked 2 of the problem files at random and attached them, please review them as well.

sorry, we don’t have time for that.

If i am feeding files that are fine and deepspeech is not able to handle it - shouldnt it be fixed?
If its not fine shouldnt you be able tell why its not fine - so that new users can avoid making those mistakes…

Shravan_Shetty:

bytes_output_mode

do you understand what it does?

My knowledge is limited to the documentation. This part of the documentaiton summarizes my understanding of bytes_output_mode

In bytes output mode the model predicts UTF-8 bytes directly instead of letters from an alphabet file. This idea was proposed in the paper Bytes Are All You Need. This mode is enabled with the --bytes_output_mode flag at training and export time. At training time, the alphabet file is not used. Instead, the model is forced to have 256 labels, with labels 0-254 corresponding to UTF-8 byte values 1-255, and label 255 is used for the CTC blank symbol. If using an external scorer at decoding time, it MUST be built according to the instructions that follow.

lissyx · December 14, 2020, 10:03am

WAV files tends to be very picky, and deepspeech depends on how it is read by TensorFlow.

DeepSpeech is not here to correct your dataset.

Just because you think I know does not mean I know: this is your dataset, I have no idea how you created, collected, imported it.

You have the error, you have the code and you have the data that breaks it. You can’t expect me to do your homework and analyze your files.

That’s copy/paste of the doc, it does not answers my question: do you understand what it does? What it is used for?

Since you have not cared enough to tell us what you are building, this can be part of the problem if you are using it wrongly.

Shravan_Shetty · December 14, 2020, 10:22am

The only code is from deepspeech, the error is from deepspeech and the data i have collected so i have shared it with you…
If i wrote deepspeech and i came on the forum to ask why my code does not work your statement would be valid.
You should be able to tell why the data wont work when everything seems fine.

I am using it to create a japanese stt model, japanese is similar to chinese. It is charector based and has some alphabets to provide context to the charector.
Since deepspeech has already been used to make a chinese model using bytes_output_mode as stated in the docs

Bytes output mode can be useful for languages with very large alphabets, such as Mandarin written with Simplified Chinese characters. It may also be useful for building multi-language models, or as a base for transfer learning. Currently these cases are untested and unsupported. Note that bytes output mode makes assumptions that hold for Mandarin written with Simplified Chinese characters and may not hold for other languages.

lissyx · December 14, 2020, 10:27am

Perfect, so you understand.

Or not. NaN / Infinite loss can have a lot of origins: too short files, mismatching length between advertised / actual lengths, etc etc.

I’m on holiday, by the way.

Shravan_Shetty · December 14, 2020, 10:35am

Sorry and thank you for your patience.

For now i will just exclude files causing infinite loss by reading using a script processing the console messages.

Thank you for the support.

othiele · December 14, 2020, 10:37am

Looks like you got 2 separate problems.

OOM error, this usually means your batch size is too high. Get one full epoch going without errors.
The inf loss. Check that your chunks are of similar lenght. Usually 4-8 seconds. Maybe merge chunks to have them all in the same range.

Shravan_Shetty · December 14, 2020, 10:42am

Reduced batch size to 8 runninig current epoch for 45 mins will update it if it fails.

I am not familiar with ‘chunk’ in this context. My audio files range from 5-60 seconds currently. Do you want me to reprocess the data to all be similar audio length?
By merge, you mean merge multiple audio files so they are around 60 seconds long so that they are all of similar length?

othiele · December 14, 2020, 10:46am

Ah, this will be the cause of some problems. Ideally chunks/audio segements/wavs have almost the same length 4-8 /10-15 seconds. I would recommend 5-10 seconds.

Shravan_Shetty · December 14, 2020, 10:47am

Ok i will updates this post with my findings after i normalize for audio length

Shravan_Shetty · December 14, 2020, 11:56am

I tried a batch size of 8 and it still fails - also it fails at a similar point when using 16 and 24 - fails when near the end.

Epoch 0 |   Training | Elapsed Time: 1:50:51 | Steps: 3471 | Loss: inf                             E The following files caused an infinite (or NaN) loss: /home/anon/Downloads/jaSTTDatasets/processedAudio/18311.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/14902.wav,/home/anon/Downloads/jaSTTDatasets/processedAudio/13702.wav
Epoch 0 |   Training | Elapsed Time: 1:52:00 | Steps: 3482 | Loss: inf                             Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node tower_0/dropout_3/GreaterEqual}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[concat/concat/_119]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[{{node tower_0/dropout_3/GreaterEqual}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 954, in main
    train()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 607, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 572, in run_set
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node tower_0/dropout_3/GreaterEqual (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

	 [[concat/concat/_119]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

  (1) Resource exhausted: OOM when allocating tensor with shape[13384,2048] and type bool on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[node tower_0/dropout_3/GreaterEqual (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Original stack trace for 'tower_0/dropout_3/GreaterEqual':
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/DeepSpeech/training/deepspeech_training/train.py", line 954, in main
    train()
  File "/DeepSpeech/training/deepspeech_training/train.py", line 484, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 317, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 244, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 204, in create_model
    layers['layer_5'] = layer_5 = dense('layer_5', output, Config.n_hidden_5, dropout_rate=dropout[5], layer_norm=FLAGS.layer_norm)
  File "/DeepSpeech/training/deepspeech_training/train.py", line 93, in dense
    output = tf.nn.dropout(output, rate=dropout_rate)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 4229, in dropout
    return dropout_v2(x, rate, noise_shape=noise_shape, seed=seed, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/nn_ops.py", line 4313, in dropout_v2
    keep_mask = random_tensor >= rate
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_math_ops.py", line 4481, in greater_equal
    "GreaterEqual", x=x, y=y, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

I have excluded files that are greater than 2 mb - it shouldnt be possible for 8x2 mb=16 mb to cause a 4gb ram gpu to go out of memory, correct me if there is some behaviour i am unaware about. Most files are around 250kb.
The face that it OOM’s towards then end is suspicious of some kind of memory leak.
Will retry with 4 batch size …

othiele · December 14, 2020, 12:18pm

Try to run the files in reverse. There is some flag option for that. If the error is at the start, it is a file.

Batch size might not be the cause. But DeepSpeech, as most ML systems, uses the same feature size for all inputs. Therefore the largest file determines the memory. Try to exclude larger files.

Shravan_Shetty · December 14, 2020, 12:42pm

Using the --reverse_train flag immediatly causes the program to go OOM. Apprently deepspeech sorts the files so large files are at the bottom - https://github.com/mozilla/DeepSpeech/issues/2513.
Anything above 4 batch size crashes.
I think the --reverse_train flag would be a useful tip for topics such as What is the ideal batch size?