Long Training Time

I have an issues for a slow training rate train on 9800 audio files

Epoch 0 | Training | Elapsed Time: 1:35:27 | Steps: 1196 | Loss: 77.780222

It seems like it is not running on GPU which I get a prompt after I run nvidia-smi
±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.43 Driver Version: 418.43 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 00000000:AF:00.0 Off | N/A |
| 61% 76C P0 76W / 105W | 1488MiB / 8119MiB | 100% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 10523 C python3 77MiB |
| 0 13468 C colmap 1401MiB |
±----------------------------------------------------------------------------+

I had set up according to the tutorial at https://github.com/mozilla/DeepSpeech/blob/master/doc/TRAINING.rst#training-your-own-model
Any idea on how to solve this problem?

How do you run the training ? Can you share exact command line ?

python3 DeepSpeech.py --train_files /home/ngwk/data/train/train.csv --dev_files /home/ngwk/data/dev/dev.csv --test_files /home/ngwk/data/test/test.csv --automatic_mixed_precision=True --epochs 10 --dropout_rate 0.025 --n_hidden 1000 --checkpoint_dir prebuiltcheckpoint

initially did not use mixed precision the process in even slower

So you are using default batch size of 1. Please read python DeepSpeech.py --helpfull and set batch size. We can’t suggest value since it depends on your data and your GPU, so you need to find the best compromise between training speech, learning and not getting GPU OOM.

means that i need to increase the batch size for both train_batch_size and dev_batch_size correct?

1 Like

Yes, this is what it means.

currently tested with batch size of 20 60 80 and 120 the condition is still the same :sweat_smile:

Are you sure this is your Python process ?

Can you re-check pip uninstall tensorflow-gpu && pip uninstall tensorflow && pip install --upgrade tensorflow-gpu==1.15.2 ?

Can you increase --log_level to verify your CUDA setup is not throwing errors ?

[quote=“lissyx, post:8, topic:57856”]
pip uninstall tensorflow-gpu && pip uninstall tensorflow && pip install --upgrade tensorflow-gpu==1.15.2

just for confirm the tensorflow should install in virtual environment right?

I had check the process with htop the PID for the process is correct

libcudart.so.10.0’; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2020-04-14 02:06:58.877930: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcublas.so.10.0’; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2020-04-14 02:06:58.877979: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcufft.so.10.0’; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2020-04-14 02:06:58.878030: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcurand.so.10.0’; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2020-04-14 02:06:58.878078: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcusolver.so.10.0’; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2020-04-14 02:06:58.878126: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcusparse.so.10.0’; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory

seems like it require cuda10.0 but current version is 10.1?

You can install 10.0 and 10.1 side by side.

Yes, inside the virtual env.

any tutorial about how can I do that?

Just extract 10.0 and cudnn matching version somewhere in your home and adjust LD_LIBRARY_PATH.