Division by zero when caculating loss

Hi, am trying to train new model of Deep Speech, it run for 7 steps and crash, error logs below:

I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:06 | Steps: 7 | Loss: 467.228032
Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | DatasEpoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000 | Dataset: /home/hangtg/Desktop/DeepSpeech/data_processing/vn/clips/dev.csv
Traceback (most recent call last):
File “DeepSpeech.py”, line 965, in
absl.app.run(main)
File “/home/hangtg/Desktop/deepspech-env/lib/python3.6/site-packages/absl/app.py”, line 299, in run
_run_main(main, args)
File “/home/hangtg/Desktop/deepspech-env/lib/python3.6/site-packages/absl/app.py”, line 250, in _run_main
sys.exit(main(argv))
File “DeepSpeech.py”, line 938, in main
train()
File “DeepSpeech.py”, line 645, in train
dev_loss = dev_loss / total_steps
ZeroDivisionError: float division by zero

as I dig in the Deepspeech.py file, i found out that the total_loss, batch_loss, step_count, total_steps didnt increase as it should be and I dont know why, the CSV files loaded in correctly.
So I tried to change step_count to 1 instead of 0 at the beginning, and it “worked”, kind of, the batch_loss still is 0, so the validation loss alway is 0.000
After training completed, the testing get the same error. This time in evaluate_tools.py file:
wer = sum(s.word_distance for s in samples) / sum(s.word_length for s in samples)

what is wrong with my files?

Only you can tell us that, as we don’t have your files. Obviously, there is something wrong in your validation set and your test set …

should I upload my csv files?

You could start by giving more context: size, number of transcriptions, training parameters you use, etc …

I honestly don’t have time to debug your files, sorry.

Am starting with a very small set, just for testing. There are 60 records in test.csv, 60 in dev.csv and 452 in train.csv, audio_sample_rate: 16000
75 epochs
train,dev,test batch size are 64
use_allow_growth: True
use_cudnn_rnn: False
no checkpoint loading since I am creating new model
automatic_mixed_precision: False
n_steps: 16
n_hidden, 2048

anything else you need?

Batch size 64 with < 64 samples, here we go. Reduce to make sure you have enough samples to fit your batch size. Try 8.

3 Likes

OH … MY … GOD!
I am so dumb! Thank you <3