Increasing --train_batch_size 2 to --train_batch_size 3 causes DeepSpeech not to train anymore. Why?

Increasing --train_batch_size 2 to --train_batch_size 3 causes Mozilla DeepSpeech not to train anymore. What could explain this?


Specifically, if I run

./DeepSpeech.py --train_files data/common-voice-v1/cv-valid-train.csv --dev_files \
 data/common-voice-v1/cv-valid-dev.csv \
--test_files data/common-voice-v1/cv-valid-test.csv  \
 --log_level 0 --limit_train 10000 --train_batch_size 2 --train True

I get set_name: train:

D Starting queue runners...
D Queue runners started.
I STARTING Optimization
D step: 77263
D epoch: 61
D target epoch: 75
D steps per epoch: 1250
D number of batches in train set: 5000
D batches per job: 4
D batches per step: 4
D number of jobs in train set: 1250
D number of jobs already trained in first epoch: 1013
D Computing Job (ID: 2, worker: 0, epoch: 61, set_name: train)...
D Starting batch...
D Finished batch step 77264.
D Sending Job (ID: 2, worker: 0, epoch: 61, set_name: train)...
D Computing Job (ID: 3, worker: 0, epoch: 61, set_name: train)...
D Starting batch...
D Finished batch step 77265.
D Sending Job (ID: 3, worker: 0, epoch: 61, set_name: train)...
D Computing Job (ID: 4, worker: 0, epoch: 61, set_name: train)...
D Starting batch...
D Finished batch step 77266.
D Sending Job (ID: 4, worker: 0, epoch: 61, set_name: train)...
[...]

However, if I run:

./DeepSpeech.py --train_files data/common-voice-v1/cv-valid-train.csv --dev_files \
 data/common-voice-v1/cv-valid-dev.csv \
--test_files data/common-voice-v1/cv-valid-test.csv  \
 --log_level 0 --limit_train 10000 --train_batch_size 3 --train True

I get set_name: test:

D Starting queue runners...
D Queue runners started.
D step: 77263
D epoch: 92
D target epoch: 75
D steps per epoch: 833
D number of batches in train set: 3334
D batches per job: 4
D batches per step: 4
D number of jobs in train set: 833
D number of jobs already trained in first epoch: 627
D Computing Job (ID: 2, worker: 0, epoch: 92, set_name: test)...
D Starting batch...
D Finished batch step 77263.
D Sending Job (ID: 2, worker: 0, epoch: 92, set_name: test)...
D Computing Job (ID: 3, worker: 0, epoch: 92, set_name: test)...
D Starting batch...
D Finished batch step 77263.
D Sending Job (ID: 3, worker: 0, epoch: 92, set_name: test)...
D Computing Job (ID: 4, worker: 0, epoch: 92, set_name: test)...
D Starting batch...
D Finished batch step 77263.
D Sending Job (ID: 4, worker: 0, epoch: 92, set_name: test)...
D Computing Job (ID: 5, worker: 0, epoch: 92, set_name: test)...
D Starting batch...
D Finished batch step 77263.
D Sending Job (ID: 5, worker: 0, epoch: 92, set_name: test)...
[...]

I train Mozilla DeepSpeech using 4 Nvidia GeForce GTX 1080.

I don’t see what can make you think it’s not training anymore. Batch size depends on the available memory and the dataset, with GTX1080 it’s likely you can push to more than 3.

what can make you think it’s not training anymore.

I see right after starting the ./DeepSpeech.py [...] command:

  • set_name: train when--train_batch_size 2
  • set_name: test when --train_batch_size 3.

This leads me to think that with --train_batch_size 3 it’s only testing, not training. Did I miss something?

Are you cleaning up the checkpoint directory ? Training with bigger batch size, if not running OOM, should be faster, so it’s possible that the test step happens.

Looking at your log, the epoch is also higher in your second log, so it could be consistent?

1 Like

Thanks, good catch, cleaning up the checkpoint directory (default location on Ubuntu: /home/[username]/.local/share/deepspeech/checkpoints) fixed the issue, i.e. it now trains when --train_batch_size > 2.

You can change the checkpoint location if you need, with --checkpoint_dir

1 Like