DeepSpeech process does not terminate
I get the messages FINISHED optimization in XXX sec and Session closed but then nothing happens. The process seems to freeze.
I tried some simple print-debugging by adding a print statement at the end of main in train.py which is executed and another print statement after absl.app.run(main) in train.py which is not executed. Therefore, I think absl.app.run does not return somehow.
My setup is
TensorFlow 1.15.0 with GPU support
DeepSpeech 0.92
Because of my whole environment setup I need DeepSpeech to exit with 0 when training is finished.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
since requirements.txt stipulated 1.15.4, I’m assuming you have not properly followed the documented steps for setting up?
I followed the documented steps. However I have to use TensorFlow 1.15.0 because of my infrastructure. I used option “DS_NOTENSORFLOW” to use my TensorFlow setup.
Checking chnages from 1.15.0 to 1.15.4 seems not be that big and seems not to include abseil
Does it run completely for just 200 files in train/dev/test? This would show that it runs in general. We had some user reporting strange problems that were due to the server environment or amount of data. Check CUDA and cuDNN
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
5
there’s just a fix with cudnn …
care to share details ? we’ve seen long starting time because of GPUs on some cases. Namely, EC2, I’ve seen it taking minutes.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
for easing debug, it would have been nice you provide the pip list output as well as verify the steps you followed (virtualenv setup? python version?)
This is likely what’s causing your issues. If you can reproduce with TensorFlow 1.15.4 installed with pip, not conda as we document, then this is something else, but until you eliminate that question it’s a waste of time to keep looking at other things.
As I stated here, I did create a new empty virtualenv with Python 3.7.4 and only did a pip intall -e . which result in installing tensorflow 1.15.4 provided by PyPi. This should eliminate tensorflow/cuda.
Morover, I checked out DeepSpeech 0.9.2.
Looks like you started train with limit, don’t. Create a small set of 200 or so and let it run through with a test set. To me it looks like training is going great. Doesn’t look like you have a problem there, but run a test set.
Using the LDC93S1 datasets seems to work. So this seems to be a bug in DeepSpeechs --limit_train or are there general reasons not to use this option or better asked what is this option intended to be used for?
Please understand options before you use them. This option simply stops after a certain amount of wavs. This can be used in combination with reverse to identify bad audios that can’t otherwise be found …
So @lissyx was right. It isn’t DeepSpeech but your special setup that you can’t change.
Then the documentation seems to be a bit misleading. python DeepSpeech.py --helpfull only says
–limit_train: maximum number of elements to use from train set - 0 means no limit
(default: ‘0’)
(an integer)
and old github issues #2777 leads to this is a valid way to reduce your datasets without changing corresponding csv.
Therefore, I do not understand why DeepSpeech should only stop and not terminate by using this options even if it should be used for debugging corrupted datasets as you mentioned
Yes, this can be read this way, but it is hard to make that unambigous with the amount of time we have. If you have a good alternative for naming them, we are happy to get PRs.
For a normal training, search this forum and you’ll find many good examples. All the best for getting the HPC Cluster up and running if possible.