Training with Mozilla Common Voice 50GB stops after some hours without any error

Hi Guys

I am facing an issue while training on top of the existing model (using 0.7.4 checkpoint) using 50 GB Common Voice Dataset .

python3 DeepSpeech.py --train_files …/cv-corpus-5.1-2020-06-22/en/clips/train.csv --dev_files …/cv-corpus-5.1-2020-06-22/en/clips/dev.csv --test_files …/cv-corpus-5.1-2020-06-22/en/clips/test.csv --train_batch_size 128 --test_batch_size 128 --dev_batch_size 128 --epochs 125 --n_hidden 2048 --learning_rate 0.0001 --dropout_rate 0.40 --export_dir …/exports/ --checkpoint_dir …/load_checkpoint/deepspeech-0.7.4-checkpoint/ --load_cudnn

Process starts fine , but it ends without any error after certain steps . I don’t have any clue what is going wrong with it?

I have ran the same process twice with slight modifications in the learning_rate and epochs , rest of the parameters being the same.

First time it stopped after 183 steps and second time it stopped after 481 steps.

Please help if someone had faced the similar issues.

Thanks in advance .

Please post the log output, without anything to go on we can’t tell you much.

In the logs , there is no error @othiele .
Last I could see is

image

After that nothing is logged in the file . I ran this command as background process .

Please read how to post in the forums, again. No images

https://discourse.mozilla.org/t/what-and-how-to-report-if-you-need-support/62071/2

Run it on the console and in the foreground. I think I remember sth like this. Either your os stops it after some time or search for “SIGINT” here. And you would see the msg if it runs in the foreground. Maybe @lissyx rembers that?

Also check /var/log/ in messages or syslog, also dmesg possibly.

2 Likes

Yes I checked kern.log @baconator , it says OOM . Should I decrease the batch_size and do the training?

Looks like you’re using a CPU? How much ram does your system have? (If gpu, how much mem does it have?) Kind of depends on your system to scale the batch sizes, though; I’d try cutting it down and seeing if it works. Also if you are training CPU you’re most certainly going to want to switch to gpu for that volume of training.

2 Likes