Long Training Time

Sheng_Wei · April 13, 2020, 3:53pm

I have an issues for a slow training rate train on 9800 audio files

Epoch 0 | Training | Elapsed Time: 1:35:27 | Steps: 1196 | Loss: 77.780222

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 10523 C python3 77MiB |
| 0 13468 C colmap 1401MiB |
±----------------------------------------------------------------------------+

I had set up according to the tutorial at https://github.com/mozilla/DeepSpeech/blob/master/doc/TRAINING.rst#training-your-own-model
Any idea on how to solve this problem?

lissyx · April 13, 2020, 4:50pm

How do you run the training ? Can you share exact command line ?

Sheng_Wei · April 13, 2020, 4:55pm

python3 DeepSpeech.py --train_files /home/ngwk/data/train/train.csv --dev_files /home/ngwk/data/dev/dev.csv --test_files /home/ngwk/data/test/test.csv --automatic_mixed_precision=True --epochs 10 --dropout_rate 0.025 --n_hidden 1000 --checkpoint_dir prebuiltcheckpoint

initially did not use mixed precision the process in even slower

lissyx · April 13, 2020, 4:57pm

So you are using default batch size of 1. Please read python DeepSpeech.py --helpfull and set batch size. We can’t suggest value since it depends on your data and your GPU, so you need to find the best compromise between training speech, learning and not getting GPU OOM.

Sheng_Wei · April 13, 2020, 5:07pm

means that i need to increase the batch size for both train_batch_size and dev_batch_size correct?

lissyx · April 13, 2020, 5:11pm

Yes, this is what it means.

Sheng_Wei · April 13, 2020, 5:29pm

currently tested with batch size of 20 60 80 and 120 the condition is still the same

lissyx · April 13, 2020, 5:41pm

Are you sure this is your Python process ?

Can you re-check pip uninstall tensorflow-gpu && pip uninstall tensorflow && pip install --upgrade tensorflow-gpu==1.15.2 ?

Can you increase --log_level to verify your CUDA setup is not throwing errors ?

Sheng_Wei · April 13, 2020, 5:49pm

[quote=“lissyx, post:8, topic:57856”]
pip uninstall tensorflow-gpu && pip uninstall tensorflow && pip install --upgrade tensorflow-gpu==1.15.2

just for confirm the tensorflow should install in virtual environment right?

I had check the process with htop the PID for the process is correct

Sheng_Wei · April 13, 2020, 6:19pm

libcudart.so.10.0’; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2020-04-14 02:06:58.877930: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcublas.so.10.0’; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2020-04-14 02:06:58.877979: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcufft.so.10.0’; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2020-04-14 02:06:58.878030: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcurand.so.10.0’; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2020-04-14 02:06:58.878078: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcusolver.so.10.0’; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2020-04-14 02:06:58.878126: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library ‘libcusparse.so.10.0’; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory

seems like it require cuda10.0 but current version is 10.1?

reuben · April 13, 2020, 8:23pm

You can install 10.0 and 10.1 side by side.

lissyx · April 13, 2020, 11:07pm

Yes, inside the virtual env.

Sheng_Wei · April 14, 2020, 3:14am

any tutorial about how can I do that?

lissyx · April 14, 2020, 9:03am

Just extract 10.0 and cudnn matching version somewhere in your home and adjust LD_LIBRARY_PATH.

Topic		Replies	Views
The same spped with cpu and with gpu DeepSpeech	42	2274	May 3, 2020
Using GPU to train a french deepspeech DeepSpeech	10	1998	May 22, 2019
Deepspeech does not seem to use gpu while training, however does use it when using native-client DeepSpeech	17	1781	November 19, 2020
Training DeepSpeech on gpu failed DeepSpeech issue	3	1160	July 20, 2021
Step, epoch, hardware, weird Duration DeepSpeech	8	608	July 1, 2020

Long Training Time

Related topics