Doesn`t use GPU while training, but during recognition it uses one

Hi guys. Im just starting my way in ML and using DeepSpeech. Thank Mozzila for your work - its really great project, but I had some problems with using it. Maybe someone will help me?
The main problem is when I use Deepspeech.py to train а model on the big datasets (like LibriSpeech, CV and others) it stop on the “I STARTING Optimization” moment and do nothing else, once I waited 1,5 days and it loaded the CPU to 100% but didnt print anything except "I STARTING Optimization". On a smaller dataset like ldc93s1 it finish well but dontt use GPU while training too.
And also there is the problem of long “words” without spaces after recognition with pre-trained model.
P.S. I installed and used everything by step-by-step instructions from the githab.
P.P.S During recognition by ./deepspeech it correctly uses the GPU.

Have you installed tensorflow-gpu package only in your virtual environment ?

I have tensorflow==1.6.0 & tensorflow-gpu==1.6.0 simultaneously. I tried to use only with tensorflow-gpu but got an error like ModuleNotFoundError: No module named 'tensorflow.python'. Should I changetensorflowtotensorflow-gpu` manually in code or do something else?

You should uninstall everything and reinstall only tensorflow-gpu. And describe all your setup steps if you still have the ModuleNotFoundError.

I run new clear Ubuntu 16.04 with python 3.5 by default, change it to 3.6, install nvidia driverds, cuda9, cudnn7, git lfs, sox(with support mp3) then:
git clone https://github.com/mozilla/DeepSpeech
cd DeepSpeech/
wget -O - https://github.com/mozilla/DeepSpeech/releases/download/v0.1.1/deepspeech-0.1.1-models.tar.gz | tar xvfz -
sudo pip3 install -r requirements.txt
python3 util/taskcluster.py --target . --arch gpu
sudo pip3 uninstall tensorflow
sudo pip3 install 'tensorflow-gpu==1.6.0'
after it try to run ./bin/run-ldc93s1.sh and get a message
+ [ ! -f DeepSpeech.py ]
+ [ ! -f data/ldc93s1/ldc93s1.csv ]
+ echo Downloading and preprocessing LDC93S1 example data, saving in ./data/ldc93s1.
Downloading and preprocessing LDC93S1 example data, saving in ./data/ldc93s1.
+ python -u bin/import_ldc93s1.py ./data/ldc93s1
Traceback (most recent call last): File "bin/import_ldc93s1.py", line 12, in <module> from tensorflow.contrib.learn.python.learn.datasets import base ImportError: No module named tensorflow.contrib.learn.python.learn.datasets

Please follow the documentation and setup a virtualenv properly. You should never ever have to sudo pip3 install, you are going to run into issues oever and over.

Oh! It really helped me. Thank you very much!
I have one question left: if I launch it on the multi-GPU machine, do I need to do something extra for the distribution of training or it will do it automatically?

If you have several NVIDIA GPUs on one system, it should pick them automagically. And you can play with the CUDA_VISIBLE_DEVICES environment variable to control the devices that are visible by your process.

Hi, I have one more question about the using of the GPU.
I started training with the following parameters:
DeepSpeech.py \
--initialize_from_frozen_model models/output_graph.pb \
--learning_rate 0.0005 \
--dropout_rate 0.2367 \
--epoch 1 \
--display_step 1 \
--validation_step 1 \
--fulltrace \
--checkpoint_dir tests/checkpoint_voip/ \
--checkpoint_secs 60 \
--export_dir tests/export_voip/ \
--summary_dir tests/tensorboard/ \
--summary_secs 120 \
--dev_batch_size 16 \
--dev_files data/voip_en/voip_en-wav-dev.csv \
--test_batch_size 16 \
--test_files data/voip_en/voip_en-wav-test.csv \
--train_batch_size 32 \
--train_files data/voip_en/voip_en-wav-train.csv

In this dataset audio records have 5-10 seconds length.
I have a 1-GPU machine.
Is it OK that the GPU is loaded at 98-100% only 4 seconds and then 26 seconds at 0% and forth and so on? (I use nvtop for the GPU monitoring)

It suggests you have something that runs on the CPU :-). It might be your data fetching that is inappropriately dimensioned: that depends on your training test and your GPU. Without more details on both, it’s unlikely we can help more.

GPU (Tesla K80) : just one DeepSpeech.py process.
CPU (Intel Xeon E5-2686 v4 Broadwell) - 22 subprocesses of DeepSpeech.py, tensorboard and some system’s processes.
About dataset: I found the average record duration - 3.44 seconds (in the training part 35662 records), it’s stored on SSD
On this video you can see two moments when it is loaded at 100%.
I have no more ideas what details to add :slight_smile:

Only one K80 ? That’s not a lot of power, as much as I remember. Small audio files? Maybe you need more into your batch.

Wait @makar.troyan I missed that you set display and validation step to one. Then what you see is likely WER computation taking CPU.

On average, one audio file is 110kb and 3,44 sec duration and has transcription like "i would like information on a mediterranean restaurant" or "m looking for a pub and it must have an internet connection and a tv".
About display and validation step: does it take a place during the epoch? I thought that it is computed after a certain epoch.

Your command line above sets those steps to “1”, so it’s happening after every epoch …

Yes, of course. But the situation like on the video happens during an epoch. It looks like a life cycle of single batch, but I’m not sure. Is WER computed after every batch and if so, can it take so much time ?

Please refrain from using videos or screenshots, it’s not readable, and heavy. I can only comment on the command line you documented earlier. And my comment holds :slight_smile:

Yes, it happens after every batch and can delay things enough to underutilize the GPU. You can set display_step to a higher number so that you don’t calculate WER reports on every epoch.

Also, as lissyx has already mentioned, make sure you’re using batch sizes that are as high as your GPU RAM can handle. In order to quickly find out if any given batch size is too high, you can look for the sort_values call in util/feeding.py and change the ascending parameter to False, so that longer samples are used first, then you’ll get OOMs faster and can search for the highest batch size that works for you. Make sure you flip it back again when training for real though :slight_smile:

I understood! Thank you very much, guys! I’ll try to play with the batch size and won’t get carried away with validation and display step ))