Too slow learning randomly

While I training, training or val time takes too much time at random point.
For example, at epoch 14 below, validation takes about 5 min which is normal. But, sometime it takes few hours with same data set and it happens randomly. At the moment, gpu is idle and cpu stats looks wired(I attach “top” and “nvidia-smi” result).

Is there anyway I can solve this problem?

Epoch 14 | Training | Elapsed Time: 1:16:39 | Steps: 3864 | Loss: 54.739649
Epoch 14 | Validation | Elapsed Time: 0:04:55 | Steps: 322 | Loss: 53.396248 | Dataset: /home/ubuntu/src/deepspeech/latest/DeepSpeech/data/csv/dev_1.csv
I Saved new best validating model with loss 53.396248 to: /home/ubuntu/src/deepspeech/latest/DeepSpeech/data/checkpoint/best_dev-70036
Epoch 15 | Training | Elapsed Time: 7:51:04 | Steps: 3864 | Loss: 53.578589
Epoch 15 | Validation | Elapsed Time: 0:51:09 | Steps: 322 | Loss: 52.377501 | Dataset: /home/ubuntu/src/deepspeech/latest/DeepSpeech/data/csv/dev_1.csv
I Saved new best validating model with loss 52.377501 to: /home/ubuntu/src/deepspeech/latest/DeepSpeech/data/checkpoint/best_dev-73900
Epoch 16 | Training | Elapsed Time: 4:45:37 | Steps: 3864 | Loss: 52.526960
Epoch 16 | Validation | Elapsed Time: 0:04:56 | Steps: 322 | Loss: 51.434787 | Dataset: /home/ubuntu/src/deepspeech/latest/DeepSpeech/data/csv/dev_1.csv

±----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 108… Off | 00000000:01:00.0 Off | N/A |
| 0% 46C P8 13W / 250W | 10915MiB / 11177MiB | 0% Default |
±------------------------------±---------------------±---------------------+
| 1 GeForce GTX 108… Off | 00000000:02:00.0 Off | N/A |
| 0% 43C P8 13W / 250W | 10915MiB / 11178MiB | 0% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 14857 C python 10889MiB |
| 1 14857 C python 10889MiB |
±----------------------------------------------------------------------------+

%Cpu0 : 0.0 us, 78.3 sy, 0.0 ni, 21.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 99.0 sy, 0.0 ni, 1.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 0.0 us,100.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.0 us,100.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu4 : 8.0 us, 75.9 sy, 0.0 ni, 16.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.0 us, 99.7 sy, 0.0 ni, 0.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 8.3 us, 87.4 sy, 0.0 ni, 4.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 0.0 us, 30.7 sy, 0.0 ni, 69.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu8 : 8.4 us, 20.1 sy, 0.0 ni, 71.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu9 : 0.3 us, 8.0 sy, 0.0 ni, 91.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu10 : 0.0 us,100.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu11 : 0.0 us, 8.0 sy, 0.0 ni, 92.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32874304 total, 1048340 free, 18044372 used, 13781592 buff/cache
KiB Swap: 33406972 total, 31324732 free, 2082240 used. 5838024 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14857 ubuntu 20 0 76.650g 0.024t 8.631g S 830.6 79.7 4605:44 DeepSpeech.py

CPU INFO: Intel® Core™ i7-8700 CPU @ 3.20GHz
MEMORY: 32G
GPU: GeForce 1080Ti x2

Is the machine loaded due to other processes?

No. It’s clean ubuntu. I tested other PC with the same spec. Same result. Any suggestion?

Not really.

I’d guess there are be some semi-periodic background tasks that are slowing progress, but I’ve no idea what they could be. That this happened on a PC and Ubuntu seems very strange, and I’ve no explanation for that.

We’ve never seen such behavior here. We’ve only seen epoch times steadily decrease. The only increases we’ve seen were from contention for hardware resources by competing processes.

Could it be a problem with data set? If there is some defect file, can this cause this problem?

I already have checked all files and excluded wave files longer than 27 seconds using sox. Is there any constraints in terms of wave length?

If there are epochs that are fast, then it’s not a data set problem. As if it were a data set problem, then there would be no fast epochs.

As to audio length, there are no constraints, but we train on audio less than 10 seconds long just to remove outliers and keep the GPU at a high utilization.

I believe It’s been resolved by upgrading ubuntu 16.04 to 18.04.
So far 15 epochs, it doesn’t seem to have the problem anymore.
Thanks!