Increasing --train_batch_size 2 to --train_batch_size 3 causes DeepSpeech not to train anymore. Why?

Franck_Dernoncourt · July 1, 2018, 7:02pm

Increasing --train_batch_size 2 to --train_batch_size 3 causes Mozilla DeepSpeech not to train anymore. What could explain this?

Specifically, if I run

./DeepSpeech.py --train_files data/common-voice-v1/cv-valid-train.csv --dev_files \
 data/common-voice-v1/cv-valid-dev.csv \
--test_files data/common-voice-v1/cv-valid-test.csv  \
 --log_level 0 --limit_train 10000 --train_batch_size 2 --train True

I get set_name: train:

D Starting queue runners...
D Queue runners started.
I STARTING Optimization
D step: 77263
D epoch: 61
D target epoch: 75
D steps per epoch: 1250
D number of batches in train set: 5000
D batches per job: 4
D batches per step: 4
D number of jobs in train set: 1250
D number of jobs already trained in first epoch: 1013
D Computing Job (ID: 2, worker: 0, epoch: 61, set_name: train)...
D Starting batch...
D Finished batch step 77264.
D Sending Job (ID: 2, worker: 0, epoch: 61, set_name: train)...
D Computing Job (ID: 3, worker: 0, epoch: 61, set_name: train)...
D Starting batch...
D Finished batch step 77265.
D Sending Job (ID: 3, worker: 0, epoch: 61, set_name: train)...
D Computing Job (ID: 4, worker: 0, epoch: 61, set_name: train)...
D Starting batch...
D Finished batch step 77266.
D Sending Job (ID: 4, worker: 0, epoch: 61, set_name: train)...
[...]

However, if I run:

./DeepSpeech.py --train_files data/common-voice-v1/cv-valid-train.csv --dev_files \
 data/common-voice-v1/cv-valid-dev.csv \
--test_files data/common-voice-v1/cv-valid-test.csv  \
 --log_level 0 --limit_train 10000 --train_batch_size 3 --train True

I get set_name: test:

D Starting queue runners...
D Queue runners started.
D step: 77263
D epoch: 92
D target epoch: 75
D steps per epoch: 833
D number of batches in train set: 3334
D batches per job: 4
D batches per step: 4
D number of jobs in train set: 833
D number of jobs already trained in first epoch: 627
D Computing Job (ID: 2, worker: 0, epoch: 92, set_name: test)...
D Starting batch...
D Finished batch step 77263.
D Sending Job (ID: 2, worker: 0, epoch: 92, set_name: test)...
D Computing Job (ID: 3, worker: 0, epoch: 92, set_name: test)...
D Starting batch...
D Finished batch step 77263.
D Sending Job (ID: 3, worker: 0, epoch: 92, set_name: test)...
D Computing Job (ID: 4, worker: 0, epoch: 92, set_name: test)...
D Starting batch...
D Finished batch step 77263.
D Sending Job (ID: 4, worker: 0, epoch: 92, set_name: test)...
D Computing Job (ID: 5, worker: 0, epoch: 92, set_name: test)...
D Starting batch...
D Finished batch step 77263.
D Sending Job (ID: 5, worker: 0, epoch: 92, set_name: test)...
[...]

I train Mozilla DeepSpeech using 4 Nvidia GeForce GTX 1080.

lissyx · July 2, 2018, 4:57am

I don’t see what can make you think it’s not training anymore. Batch size depends on the available memory and the dataset, with GTX1080 it’s likely you can push to more than 3.

Franck_Dernoncourt · July 2, 2018, 6:11am

what can make you think it’s not training anymore.

I see right after starting the ./DeepSpeech.py [...] command:

set_name: train when--train_batch_size 2
set_name: test when --train_batch_size 3.

This leads me to think that with --train_batch_size 3 it’s only testing, not training. Did I miss something?

lissyx · July 2, 2018, 6:21am

Are you cleaning up the checkpoint directory ? Training with bigger batch size, if not running OOM, should be faster, so it’s possible that the test step happens.

Looking at your log, the epoch is also higher in your second log, so it could be consistent?

Franck_Dernoncourt · July 2, 2018, 6:50am

Thanks, good catch, cleaning up the checkpoint directory (default location on Ubuntu: /home/[username]/.local/share/deepspeech/checkpoints) fixed the issue, i.e. it now trains when --train_batch_size > 2.

lissyx · July 2, 2018, 7:20am

You can change the checkpoint location if you need, with --checkpoint_dir

Topic		Replies	Views
Long Training Time DeepSpeech	13	630	April 14, 2020
Training from scratch (English) DeepSpeech	6	535	June 6, 2020
Training with Mozilla Common Voice 50GB stops after some hours without any error DeepSpeech	6	935	July 21, 2020
Problem with fine tuning 0.81 checkpoint for specific domain like Biology DeepSpeech	25	1066	December 14, 2020
Step, epoch, hardware, weird Duration DeepSpeech	8	608	July 1, 2020

Increasing --train_batch_size 2 to --train_batch_size 3 causes DeepSpeech not to train anymore. Why?

Related topics