Training with Mozilla Common Voice 50GB stops after some hours without any error

Harmandeep_Singh · July 19, 2020, 12:30pm

Hi Guys

I am facing an issue while training on top of the existing model (using 0.7.4 checkpoint) using 50 GB Common Voice Dataset .

python3 DeepSpeech.py --train_files …/cv-corpus-5.1-2020-06-22/en/clips/train.csv --dev_files …/cv-corpus-5.1-2020-06-22/en/clips/dev.csv --test_files …/cv-corpus-5.1-2020-06-22/en/clips/test.csv --train_batch_size 128 --test_batch_size 128 --dev_batch_size 128 --epochs 125 --n_hidden 2048 --learning_rate 0.0001 --dropout_rate 0.40 --export_dir …/exports/ --checkpoint_dir …/load_checkpoint/deepspeech-0.7.4-checkpoint/ --load_cudnn

Process starts fine , but it ends without any error after certain steps . I don’t have any clue what is going wrong with it?

I have ran the same process twice with slight modifications in the learning_rate and epochs , rest of the parameters being the same.

First time it stopped after 183 steps and second time it stopped after 481 steps.

Please help if someone had faced the similar issues.

Thanks in advance .

othiele · July 19, 2020, 1:21pm

Please post the log output, without anything to go on we can’t tell you much.

Harmandeep_Singh · July 19, 2020, 1:32pm

In the logs , there is no error @othiele .
Last I could see is

After that nothing is logged in the file . I ran this command as background process .

othiele · July 19, 2020, 3:45pm

Please read how to post in the forums, again. No images

https://discourse.mozilla.org/t/what-and-how-to-report-if-you-need-support/62071/2

Run it on the console and in the foreground. I think I remember sth like this. Either your os stops it after some time or search for “SIGINT” here. And you would see the msg if it runs in the foreground. Maybe @lissyx rembers that?

baconator · July 21, 2020, 3:21am

Also check /var/log/ in messages or syslog, also dmesg possibly.

Harmandeep_Singh · July 21, 2020, 3:40am

Yes I checked kern.log @baconator , it says OOM . Should I decrease the batch_size and do the training?

baconator · July 21, 2020, 5:03am

Looks like you’re using a CPU? How much ram does your system have? (If gpu, how much mem does it have?) Kind of depends on your system to scale the batch sizes, though; I’d try cutting it down and seeing if it works. Also if you are training CPU you’re most certainly going to want to switch to gpu for that volume of training.

Topic		Replies	Views
Using DeepSpeeach with 1 Epoch DeepSpeech	5	679	August 18, 2020
Empty results in inference mode DeepSpeech learning	3	603	February 6, 2020
Increasing --train_batch_size 2 to --train_batch_size 3 causes DeepSpeech not to train anymore. Why? DeepSpeech	5	898	July 2, 2018
Overfitting on Common Voice DeepSpeech	8	867	November 30, 2019
Common Voice Training DeepSpeech	2	349	June 24, 2021

Training with Mozilla Common Voice 50GB stops after some hours without any error

Related topics