Deepspeech does not seem to use gpu while training, however does use it when using native-client

This may look like a duplicate of issue - Doesn`t use GPU while training, but during recognition it uses one. However i have properly sourced my venv.

These are the commands i used to rebuild the training environment after cloning:

git clone --branch v0.9.1 https://github.com/mozilla/DeepSpeech
cd Deepspeech
python3 -m venv ./venv
source ./venv/bin/activate
pip3 install --upgrade pip==20.2.2 wheel==0.34.2 setuptools==49.6.0
pip3 install --upgrade -e .
pip3 uninstall tensorflow
pip3 install 'tensorflow-gpu==1.15.4'

At pip3 install --upgrade -e . - it does complain about numpy version

Then i do ./bin/run-ldc93s1.sh which works
To test gpu, i just duplicated the data lines in ./data/ldc93s1/ldc93s1.csv, until i have 100+ inputs instead of just 1.

Then i modified the ./bin/run-ldc93s1.sh file -

#Force only one visible device because we have a single-sample dataset
#and when trying to run on multiple devices (like GPUs), this will break
#export CUDA_VISIBLE_DEVICES=0

python -u DeepSpeech.py --noshow_progressbar
–train_files data/ldc93s1/ldc93s1.csv
–train_batch_size 100
–n_hidden 100
–epochs 20
–bytes_output_mode
–checkpoint_dir “$checkpoint_dir”
“$@”

It runs successfully - but it does not use the gpu at all.
Any thoughts?

You don’t share your system setup, you don’t document anything, and you don’t provide any runtime logs. Are we supposed to do divination ?

logs for ./bin/run-ldc93s1.sh

  • [ ! -f DeepSpeech.py ]
  • [ ! -f data/ldc93s1/ldc93s1.csv ]
  • [ -d ]
  • python -c from xdg import BaseDirectory as xdg; print(xdg.save_data_path(“deepspeech/ldc93s1”))
  • checkpoint_dir=/home/anon/.local/share/deepspeech/ldc93s1
  • python -u DeepSpeech.py --noshow_progressbar --train_files data/ldc93s1/ldc93s1.csv --train_batch_size 100 --n_hidden 100 --epochs 5 --bytes_output_mode --checkpoint_dir /home/anon/.local/share/deepspeech/ldc93s1
    I Could not find best validating checkpoint.
    I Loading most recent checkpoint from /home/anon/.local/share/deepspeech/ldc93s1/train-120
    I Loading variable from checkpoint: beta1_power
    I Loading variable from checkpoint: beta2_power
    I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
    I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam
    I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam_1
    I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
    I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam
    I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam_1
    I Loading variable from checkpoint: global_step
    I Loading variable from checkpoint: layer_1/bias
    I Loading variable from checkpoint: layer_1/bias/Adam
    I Loading variable from checkpoint: layer_1/bias/Adam_1
    I Loading variable from checkpoint: layer_1/weights
    I Loading variable from checkpoint: layer_1/weights/Adam
    I Loading variable from checkpoint: layer_1/weights/Adam_1
    I Loading variable from checkpoint: layer_2/bias
    I Loading variable from checkpoint: layer_2/bias/Adam
    I Loading variable from checkpoint: layer_2/bias/Adam_1
    I Loading variable from checkpoint: layer_2/weights
    I Loading variable from checkpoint: layer_2/weights/Adam
    I Loading variable from checkpoint: layer_2/weights/Adam_1
    I Loading variable from checkpoint: layer_3/bias
    I Loading variable from checkpoint: layer_3/bias/Adam
    I Loading variable from checkpoint: layer_3/bias/Adam_1
    I Loading variable from checkpoint: layer_3/weights
    I Loading variable from checkpoint: layer_3/weights/Adam
    I Loading variable from checkpoint: layer_3/weights/Adam_1
    I Loading variable from checkpoint: layer_5/bias
    I Loading variable from checkpoint: layer_5/bias/Adam
    I Loading variable from checkpoint: layer_5/bias/Adam_1
    I Loading variable from checkpoint: layer_5/weights
    I Loading variable from checkpoint: layer_5/weights/Adam
    I Loading variable from checkpoint: layer_5/weights/Adam_1
    I Loading variable from checkpoint: layer_6/bias
    I Loading variable from checkpoint: layer_6/bias/Adam
    I Loading variable from checkpoint: layer_6/bias/Adam_1
    I Loading variable from checkpoint: layer_6/weights
    I Loading variable from checkpoint: layer_6/weights/Adam
    I Loading variable from checkpoint: layer_6/weights/Adam_1
    I Loading variable from checkpoint: learning_rate
    I STARTING Optimization
    I Training epoch 0…
    I Finished training epoch 0 - loss: 357.920593

I Training epoch 1…
I Finished training epoch 1 - loss: 357.626068

I Training epoch 2…
I Finished training epoch 2 - loss: 357.542145

I Training epoch 3…
I Finished training epoch 3 - loss: 357.455444

I Training epoch 4…
I Finished training epoch 4 - loss: 357.503204

I FINISHED optimization in 0:00:04.511124

Running this on mx-linux - its a debian based linux distro
Graphics card Nvidia GTX 1650

please enable more verbose logging so we get CUDA feedback from tensorflow

And I again have to ask:

  • CUDA version matching tensorflow requirements ?
  • CUDA working ?
  • tensorflow with cuda working?
  • nvidia-smi output ?

Snapshot of nvtop while training

Cool, you continue to ignore completely our guidelines for reaching support.

Anyway, this shows deepspeech using the GPU, so I really don’t understand your problem.

Explain it clearly, follow guidelines.

$ python -u DeepSpeech.py --noshow_progressbar --train_files data/ldc93s1/ldc93s1.csv --train_batch_size 100 --n_hidden 100 --epochs 3 --bytes_output_mode --checkpoint_dir /home/anon/.local/share/deepspeech/ldc93s1 --verbosity 1
I Could not find best validating checkpoint.
I Loading most recent checkpoint from /home/anon/.local/share/deepspeech/ldc93s1/train-195
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias/Adam_1
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
I Training epoch 0…
I Finished training epoch 0 - loss: 351.114716

I Training epoch 1…
I Finished training epoch 1 - loss: 350.875793

I Training epoch 2…
I Finished training epoch 2 - loss: 350.717255

I FINISHED optimization in 0:00:03.810965

Command with --verbosity 1

Following is a screenshot of all cuda libraries in my laptop

I use the native client and it uses gpu fine - the logs from it may help you

GPU usage while using client -

As you can see while training the GPU mem doesnt even go over 1% - however while using the client it goes upto 82%

My batch size is 100 and clip size about 5 second each

Please let me know if you require any other information. Thanks in advance!

because you have not a lot of data.

PLEASE STOP SHARING SCREENSHOTS. Am I more clear?

Ok will try with more test data and get back to you if it does not resolve the issue, Thank you for everything.

As suggested i increased my test data from 120 audio files to 2000, but the gpu usage stays the same at

0 Compute 53MB 1% 560% 1725MB python -u DeepSpeech.py

GPU at 1% CPU 560%

gpu batch size maybe?

anyway, it’s a long-time covered topic, please read carefully existing posts, I don’t have time to debug your setup.

batch size is 100 like previously

Alright however the only reason i made this post was because i couldnt find much, the only other post similar was the one i linked. Anyway i will try using the docker file. Thank you!

If you had shared full text logs, instead of screenshots, I could have seen you are using CUDA 10.1 which IS NOT SUPPORTED BY TensorFlow 1.15.4. Please use proper setup as documented.

@Shravan_Shetty it’s really a mess: you don’t share the full training logs with cuda infos for training I can only infer you are using CUDA 10.1 from your inference screenshots.