Division by zero when caculating loss

nthanhha26 · March 9, 2020, 8:56am

Hi, am trying to train new model of Deep Speech, it run for 7 steps and crash, error logs below:

I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:06 | Steps: 7 | Loss: 467.228032
Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /home/hangtg/Desktop/DeepSpeech/data_processing/vn/clips/dev.csv
Traceback (most recent call last):
File “DeepSpeech.py”, line 965, in
absl.app.run(main)
File “/home/hangtg/Desktop/deepspech-env/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/hangtg/Desktop/deepspech-env/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “DeepSpeech.py”, line 938, in main
train()
File “DeepSpeech.py”, line 645, in train
dev_loss = dev_loss / total_steps
ZeroDivisionError: float division by zero

as I dig in the Deepspeech.py file, i found out that the total_loss, batch_loss, step_count, total_steps didnt increase as it should be and I dont know why, the CSV files loaded in correctly.
So I tried to change step_count to 1 instead of 0 at the beginning, and it “worked”, kind of, the batch_loss still is 0, so the validation loss alway is 0.000
After training completed, the testing get the same error. This time in evaluate_tools.py file:
wer = sum(s.word_distance for s in samples) / sum(s.word_length for s in samples)

what is wrong with my files?

lissyx · March 9, 2020, 8:58am

Only you can tell us that, as we don’t have your files. Obviously, there is something wrong in your validation set and your test set …

nthanhha26 · March 9, 2020, 9:00am

should I upload my csv files?

lissyx · March 9, 2020, 9:04am

You could start by giving more context: size, number of transcriptions, training parameters you use, etc …

I honestly don’t have time to debug your files, sorry.

nthanhha26 · March 9, 2020, 9:19am

Am starting with a very small set, just for testing. There are 60 records in test.csv, 60 in dev.csv and 452 in train.csv, audio_sample_rate: 16000
75 epochs
train,dev,test batch size are 64
use_allow_growth: True
use_cudnn_rnn: False
no checkpoint loading since I am creating new model
automatic_mixed_precision: False
n_steps: 16
n_hidden, 2048

anything else you need?

lissyx · March 9, 2020, 9:24am

Batch size 64 with < 64 samples, here we go. Reduce to make sure you have enough samples to fit your batch size. Try 8.

nthanhha26 · March 9, 2020, 9:26am

OH … MY … GOD!
I am so dumb! Thank you <3