Testing step slow using CPU when GPU was used for training

ambigus9 · October 4, 2021, 9:56pm

I getting errors in the testing step. Specially Segmentation fault (core dumped) and really slow testing step, because DeepSpeech is using CPU for inference against GPU. Note that GPU is working well while is training.

Here my code:

CUDA_VISIBLE_DEVICES=0 python3 DeepSpeech.py --train_files ../dataset/cv-corpus-7.0-2021-07-21/es/clips/train_v2.csv \
                        --dev_files ../dataset/cv-corpus-7.0-2021-07-21/es/clips/dev_v2.csv \
                        --test_files ../dataset/cv-corpus-7.0-2021-07-21/es/clips/test_v2.csv \
                        --train_batch_size 32 \
                        --dev_batch_size 32 \
                        --test_batch_size 32 \
                        --use_allow_growth \
                        --epochs 1 \
                        --export_dir ../models/vtt_v1/ \
                        --checkpoint_dir ../checkpoints/vtt_v1/ \
                        --summary_dir /home/DeepSpeech

Here my logs:

I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 1:19:12 | Steps: 6047 | Loss: 55.548004
…

I FINISHED optimization in 1:21:07.367672
I Loading best validating checkpoint from …/checkpoints/vtt_v1/best_dev-12093
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/weights
Testing model on …/dataset/cv-corpus-7.0-2021-07-21/es/clips/test_v2.csv
Test epoch | Steps: 4 | Elapsed Time: 0:01:36
Testing model on …/dataset/cv-corpus-7.0-2021-07-21/es/clips/test_v2.csv
Test epoch | Steps: 32 | Elapsed Time: 0:17:58 Fatal Python error: Segmentation fault

Thread 0x00007f1e56ffd700 (most recent call first):
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 379 in _recv
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 407 in _recv_bytes
File “/usr/lib/python3.6/multiprocessing/connection.py”, line 250 in recv
File “/usr/lib/python3.6/multiprocessing/pool.py”, line 463 in _handle_results
File “/usr/lib/python3.6/threading.py”, line 864 in run
File “/usr/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/usr/lib/python3.6/threading.py”, line 884 in _bootstrap

Thread 0x00007f1e577fe700 (most recent call first):
File “/home/DeepSpeech/training/deepspeech_training/util/helpers.py”, line 123 in _limit
File “/usr/lib/python3.6/multiprocessing/pool.py”, line 290 in _guarded_task_generation
File “/usr/lib/python3.6/multiprocessing/pool.py”, line 419 in _handle_tasks
File “/usr/lib/python3.6/threading.py”, line 864 in run
File “/usr/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/usr/lib/python3.6/threading.py”, line 884 in _bootstrap

Thread 0x00007f1e57fff700 (most recent call first):
File “/usr/lib/python3.6/multiprocessing/pool.py”, line 406 in _handle_workers
File “/usr/lib/python3.6/threading.py”, line 864 in run
File “/usr/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/usr/lib/python3.6/threading.py”, line 884 in _bootstrap

Thread 0x00007f1fcbfda700 (most recent call first):
File “/usr/lib/python3.6/threading.py”, line 295 in wait
File “/usr/lib/python3.6/queue.py”, line 164 in get
File “/root/tmp/deepspeech-train-venv/lib/python3.6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py”, line 159 in run
File “/usr/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/usr/lib/python3.6/threading.py”, line 884 in _bootstrap

Thread 0x00007f1fcb7d9700 (most recent call first):
File “/usr/lib/python3.6/threading.py”, line 295 in wait
File “/usr/lib/python3.6/queue.py”, line 164 in get
File “/root/tmp/deepspeech-train-venv/lib/python3.6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py”, line 159 in run
File “/usr/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/usr/lib/python3.6/threading.py”, line 884 in _bootstrap

Thread 0x00007f1fcafd8700 (most recent call first):
File “/usr/lib/python3.6/threading.py”, line 295 in wait
File “/usr/lib/python3.6/queue.py”, line 164 in get
File “/root/tmp/deepspeech-train-venv/lib/python3.6/site-packages/tensorflow_core/python/summary/writer/event_file_writer.py”, line 159 in run
File “/usr/lib/python3.6/threading.py”, line 916 in _bootstrap_inner
File “/usr/lib/python3.6/threading.py”, line 884 in _bootstrap

Thread 0x00007f205b0e0740 (most recent call first):
File “/root/tmp/deepspeech-train-venv/lib/python3.6/site-packages/ds_ctcdecoder/swigwrapper.py”, line 813 in ctc_beam_search_decoder_batch
File “/root/tmp/deepspeech-train-venv/lib/python3.6/site-packages/ds_ctcdecoder/init.py”, line 225 in ctc_beam_search_decoder_batch
File “/home/DeepSpeech/training/deepspeech_training/evaluate.py”, line 114 in run_test
File “/home/DeepSpeech/training/deepspeech_training/evaluate.py”, line 132 in evaluate
File “/home/DeepSpeech/training/deepspeech_training/train.py”, line 682 in test
File “/home/DeepSpeech/training/deepspeech_training/train.py”, line 958 in main
File “/root/tmp/deepspeech-train-venv/lib/python3.6/site-packages/absl/app.py”, line 258 in _run_main
File “/root/tmp/deepspeech-train-venv/lib/python3.6/site-packages/absl/app.py”, line 312 in run
File “/home/DeepSpeech/training/deepspeech_training/train.py”, line 982 in run_script
File “DeepSpeech.py”, line 12 in
Segmentation fault (core dumped)

dkreutz · October 5, 2021, 6:14am

Some Numpy versions can cause such “segmentation fault” errors.

You also may want to check out coqui-ai STT which was forked/founded by previous DeepSpeech developer team.

ambigus9 · October 5, 2021, 2:12pm

Thanks. Any version of Numpy suggested for the last stable version of DeepSpeech? (While i starting to explore coqui-ai STT)

ambigus9 · October 6, 2021, 11:06pm

@dkreutz Now getting full RAM use. Any idea what’s happening?

dkreutz · October 7, 2021, 5:22pm

Never trained a DeepSpeech model myself, so I can only guess that batchsize is probably too large for your setup ( how much RAM does your GPU have?)
You may also check meaning of parameter use_allow_growth and set it accordingly.

Again, I recommend to switch to coqui-STT as Mozillas DeepSpeech repository seems no longer being actively maintained… see https://github.com/mozilla/DeepSpeech/issues/3693

lissyx · October 8, 2021, 8:43am

Test phase is pure CPU, it’s known and documented.

This is unrelated and likely an ABI mismatch of the ds_ctcdecoder module you are using. We have guards in place to try and avoid those, but maybe it fells into the cracks.

ambigus9 · October 8, 2021, 1:39pm

@lissyx Thanks, Which version of ds_ctcdecoder should i install?

lissyx · October 8, 2021, 1:43pm

I don’t know, you dont provide any context on what you are doing, and I don’t have time to search, I’m not working on deepspeech anymore for a long time. Everything is documented.

ambigus9 · October 8, 2021, 1:49pm

@lissyx Thanks. I just trying to finish a training , here is the command i using:

CUDA_VISIBLE_DEVICES=0 python3 DeepSpeech.py --train_files ../dataset/cv-corpus-7.0-2021-07-21/es/clips/train_v2.csv \
                        --dev_files ../dataset/cv-corpus-7.0-2021-07-21/es/clips/dev_v2.csv \
                        --test_files ../dataset/cv-corpus-7.0-2021-07-21/es/clips/test_v2.csv \
                        --train_batch_size 32 \
                        --dev_batch_size 32 \
                        --test_batch_size 32 \
                        --use_allow_growth \
                        --epochs 1 \
                        --export_dir ../models/vtt_v1/ \
                        --checkpoint_dir ../checkpoints/vtt_v1/ \
                        --summary_dir /home/DeepSpeech

And here is the pip list result:

deepspeech-tflite    0.9.3
deepspeech-training  0.9.3        /home/DeepSpeech/training
ds-ctcdecoder        0.9.3
tensorflow           1.15.4
tensorflow-estimator 1.15.1
tensorflow-gpu       1.15.4

Thanks for any advice.

Testing step slow using CPU when GPU was used for training

I STARTING Optimization Epoch 0 | Training | Elapsed Time: 1:19:12 | Steps: 6047 | Loss: 55.548004 …

I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 1:19:12 | Steps: 6047 | Loss: 55.548004
…