Absl.app.run does not return

When running DeepSpeech’s training on Mozilla’s Common Voice with

python3 DeepSpeech.py \
    --train_files $CLIPDIR/train.csv \
    --alphabet_config_path ./data/alphabet.txt \
    --checkpoint_dir ${OUTDIR}/checkpoints \
    --train_batch_size 128 \
    --show_progressbar true \
    --learning_rate 0.000001 \
    --checkpoint_secs 60 \
    --summary_dir ${OUTDIR}/tensorboard \
    --epochs 10 \
    --limit_train 500 \
    --log_level 0 \
    --n_hidden 1250 \
    --early_stop true \
    --dropout_rate 0.05 \
    --dropout_rate2 0.05 \
    --dropout_rate3 0.05 \
    --dropout_rate4 0 \
    --dropout_rate5 0 \
    --dropout_rate6 0.05 \
    --es_epochs 5 \
    --train_cudnn

DeepSpeech process does not terminate
I get the messages FINISHED optimization in XXX sec and Session closed but then nothing happens. The process seems to freeze.
I tried some simple print-debugging by adding a print statement at the end of main in train.py which is executed and another print statement after absl.app.run(main) in train.py which is not executed. Therefore, I think absl.app.run does not return somehow.

My setup is
TensorFlow 1.15.0 with GPU support
DeepSpeech 0.92

Because of my whole environment setup I need DeepSpeech to exit with 0 when training is finished.

since requirements.txt stipulated 1.15.4, I’m assuming you have not properly followed the documented steps for setting up?

I followed the documented steps. However I have to use TensorFlow 1.15.0 because of my infrastructure. I used option “DS_NOTENSORFLOW” to use my TensorFlow setup.
Checking chnages from 1.15.0 to 1.15.4 seems not be that big and seems not to include abseil

Does it run completely for just 200 files in train/dev/test? This would show that it runs in general. We had some user reporting strange problems that were due to the server environment or amount of data. Check CUDA and cuDNN

there’s just a fix with cudnn …

care to share details ? we’ve seen long starting time because of GPUs on some cases. Namely, EC2, I’ve seen it taking minutes.

for easing debug, it would have been nice you provide the pip list output as well as verify the steps you followed (virtualenv setup? python version?)

Accoring to output the number of epochs I specified is trained with a decreasing loss.

This is likely what’s causing your issues. If you can reproduce with TensorFlow 1.15.4 installed with pip, not conda as we document, then this is something else, but until you eliminate that question it’s a waste of time to keep looking at other things.

1 Like

I’m running it on my university’s hpc cluster within a software toolschain and SLURM batch scheduling.

✗ pip list
Package              Version      Location                                                          
-------------------- ------------ ------------------------------------------------------------------
absl-py              0.8.1        
alembic              1.4.3        
appdirs              1.4.4        
astor                0.8.0        
attrdict             2.0.1        
attrs                20.3.0       
audioread            2.1.9        
beautifulsoup4       4.9.3        
bs4                  0.0.1        
certifi              2020.11.8    
cffi                 1.14.4       
chardet              3.0.4        
cliff                3.5.0        
cmaes                0.7.0        
cmd2                 1.4.0        
colorama             0.4.4        
colorlog             4.6.2        
decorator            4.4.2        
deepspeech-training  0.9.1       /path/to/DeepSpeech/training
ds-ctcdecoder        0.9.1        
gast                 0.2.2        
google-pasta         0.1.8        
grpcio               1.25.0       
h5py                 2.10.0       
idna                 2.10         
importlib-metadata   3.1.1        
joblib               0.17.0       
Keras-Applications   1.0.8        
Keras-Preprocessing  1.1.0        
librosa              0.8.0        
llvmlite             0.31.0       
Mako                 1.1.3        
Markdown             3.1.1        
MarkupSafe           1.1.1        
mpi4py               3.0.2        
mpmath               1.1.0        
numba                0.47.0       
numpy                1.17.3       
opt-einsum           3.1.0        
optuna               2.3.0        
opuslib              2.0.0        
packaging            20.7         
pandas               0.25.3       
pbr                  5.5.1        
pip                  19.0.3       
pooch                1.3.0        
prettytable          0.7.2        
progressbar2         3.53.1       
protobuf             3.10.0       
pycparser            2.20         
pyparsing            2.4.7        
pyperclip            1.8.1        
python-dateutil      2.8.1        
python-editor        1.0.4        
python-utils         2.4.0        
pytz                 2020.4       
pyxdg                0.27         
PyYAML               5.3.1        
requests             2.25.0       
resampy              0.2.2        
scikit-learn         0.23.2       
scipy                1.3.1        
scorep               3.0          
semver               2.13.0       
setuptools           40.8.0       
six                  1.15.0       
SoundFile            0.10.3.post1 
soupsieve            2.0.1        
sox                  1.4.1        
SQLAlchemy           1.3.20       
stevedore            3.3.0        
tensorboard          1.15.0       
tensorflow           1.15.0       
tensorflow-estimator 1.15.1       
termcolor            1.1.0        
threadpoolctl        2.1.0        
tqdm                 4.54.0       
urllib3              1.26.2       
wcwidth              0.2.5        
Werkzeug             0.16.0       
wrapt                1.11.2       
zipp                 3.4.0        

✗ python --version
Python 3.7.4

I’m using pythons virtualenv but TensorFlow comes from “outside” provides by lmod

Tensorflow from outside of the venv could clash in weird ways with absl, this is something i already saw

To get some usefull further support I tried DS with TensorFlow (without GPU) provided by pip

✗ pip list
Package              Version      Location
-------------------- ------------ ------------------------------------------------------------------
absl-py              0.11.0
alembic              1.4.3
appdirs              1.4.4
astor                0.8.1
attrdict             2.0.1
attrs                20.3.0
audioread            2.1.9
beautifulsoup4       4.9.3
bs4                  0.0.1
cached-property      1.5.2
certifi              2020.12.5
cffi                 1.14.4
chardet              3.0.4
cliff                3.5.0
cmaes                0.7.0
cmd2                 1.4.0
colorama             0.4.4
colorlog             4.6.2
decorator            4.4.2
deepspeech-training  0.9.2        /path/to/ds/training
ds-ctcdecoder        0.9.2
gast                 0.2.2
google-pasta         0.2.0
grpcio               1.34.0
h5py                 3.1.0
idna                 2.10
importlib-metadata   3.1.1
joblib               0.17.0
Keras-Applications   1.0.8
Keras-Preprocessing  1.1.2
librosa              0.8.0
llvmlite             0.31.0
Mako                 1.1.3
Markdown             3.3.3
MarkupSafe           1.1.1
numba                0.47.0
numpy                1.19.4
opt-einsum           3.3.0
optuna               2.3.0
opuslib              2.0.0
packaging            20.7
pandas               1.1.5
pbr                  5.5.1
pip                  20.2.2
pooch                1.3.0
prettytable          0.7.2
progressbar2         3.53.1
protobuf             3.14.0
pycparser            2.20
pyparsing            2.4.7
pyperclip            1.8.1
python-dateutil      2.8.1
python-editor        1.0.4
python-utils         2.4.0
pytz                 2020.4
pyxdg                0.27
PyYAML               5.3.1
requests             2.25.0
resampy              0.2.2
scikit-learn         0.23.2
scipy                1.5.4
semver               2.13.0
setuptools           49.6.0
six                  1.15.0
SoundFile            0.10.3.post1
soupsieve            2.0.1
sox                  1.4.1
SQLAlchemy           1.3.20
stevedore            3.3.0
tensorboard          1.15.0
tensorflow           1.15.4
tensorflow-estimator 1.15.1
termcolor            1.1.0
threadpoolctl        2.1.0
tqdm                 4.54.1
urllib3              1.26.2
wcwidth              0.2.5
Werkzeug             1.0.1
wheel                0.34.2
wrapt                1.12.1
zipp                 3.4.0

The problem does not change. It hangs with

--------------------------------------------------------------------------------
I FINISHED optimization in 0:00:44.761575
D Session closed.

after

python3 -u DeepSpeech.py \
    --train_files $CLIPDIR/train.csv \
    --dev_files $CLIPDIR/dev.csv \
    --limit_train 100 \
    --limit_dev 50 \
    --checkpoint_dir ${OUTDIR}/checkpoints \
    --train_batch_size 128 \
    --learning_rate 0.000001 \
    --summary_dir ${OUTDIR}/tensorboard \
    --epochs 1 \
    --n_hidden 1250 \
    --log_level 0

I don’t even know where that come from. Sorry. You seem to be using a specific setup I’m not sure what happens there.

As @reuben said, we can’t do more until we elimitate tensorflow/cuda from the equation.

I’m sorry but at that point, it’s not clear whether you have fully populated a virtualenv as documented or not.

This came from train.py Line 678

As I stated here, I did create a new empty virtualenv with Python 3.7.4 and only did a pip intall -e . which result in installing tensorflow 1.15.4 provided by PyPi. This should eliminate tensorflow/cuda.
Morover, I checked out DeepSpeech 0.9.2.

Looks like you started train with limit, don’t. Create a small set of 200 or so and let it run through with a test set. To me it looks like training is going great. Doesn’t look like you have a problem there, but run a test set.

Using the LDC93S1 datasets seems to work. So this seems to be a bug in DeepSpeechs --limit_train or are there general reasons not to use this option or better asked what is this option intended to be used for?

Please understand options before you use them. This option simply stops after a certain amount of wavs. This can be used in combination with reverse to identify bad audios that can’t otherwise be found …

So @lissyx was right. It isn’t DeepSpeech but your special setup that you can’t change.

Then the documentation seems to be a bit misleading. python DeepSpeech.py --helpfull only says

–limit_train: maximum number of elements to use from train set - 0 means no limit
(default: ‘0’)
(an integer)

and old github issues #2777 leads to this is a valid way to reduce your datasets without changing corresponding csv.

Therefore, I do not understand why DeepSpeech should only stop and not terminate by using this options even if it should be used for debugging corrupted datasets as you mentioned

Yes, this can be read this way, but it is hard to make that unambigous with the amount of time we have. If you have a good alternative for naming them, we are happy to get PRs.

For a normal training, search this forum and you’ll find many good examples. All the best for getting the HPC Cluster up and running if possible.

Problems seems to be solved this way. Thanks for your help.
I will think on a better naming.