Hi Guys
I am facing an issue while training on top of the existing model (using 0.7.4 checkpoint) using 50 GB Common Voice Dataset .
python3 DeepSpeech.py --train_files …/cv-corpus-5.1-2020-06-22/en/clips/train.csv --dev_files …/cv-corpus-5.1-2020-06-22/en/clips/dev.csv --test_files …/cv-corpus-5.1-2020-06-22/en/clips/test.csv --train_batch_size 128 --test_batch_size 128 --dev_batch_size 128 --epochs 125 --n_hidden 2048 --learning_rate 0.0001 --dropout_rate 0.40 --export_dir …/exports/ --checkpoint_dir …/load_checkpoint/deepspeech-0.7.4-checkpoint/ --load_cudnn
Process starts fine , but it ends without any error after certain steps . I don’t have any clue what is going wrong with it?
I have ran the same process twice with slight modifications in the learning_rate and epochs , rest of the parameters being the same.
First time it stopped after 183 steps and second time it stopped after 481 steps.
Please help if someone had faced the similar issues.
Thanks in advance .
othiele
(Olaf Thiele)
July 19, 2020, 1:21pm
2
Please post the log output, without anything to go on we can’t tell you much.
In the logs , there is no error @othiele .
Last I could see is
After that nothing is logged in the file . I ran this command as background process .
othiele
(Olaf Thiele)
July 19, 2020, 3:45pm
4
Please read how to post in the forums, again. No images
https://discourse.mozilla.org/t/what-and-how-to-report-if-you-need-support/62071/2
Run it on the console and in the foreground. I think I remember sth like this. Either your os stops it after some time or search for “SIGINT” here. And you would see the msg if it runs in the foreground. Maybe @lissyx rembers that?
baconator
(Bacon Ator)
July 21, 2020, 3:21am
5
Also check /var/log/ in messages or syslog, also dmesg possibly.
2 Likes
Yes I checked kern.log @baconator , it says OOM . Should I decrease the batch_size and do the training?
baconator
(Bacon Ator)
July 21, 2020, 5:03am
7
Looks like you’re using a CPU? How much ram does your system have? (If gpu, how much mem does it have?) Kind of depends on your system to scale the batch sizes, though; I’d try cutting it down and seeing if it works. Also if you are training CPU you’re most certainly going to want to switch to gpu for that volume of training.
2 Likes