Incremental training or all data at once?

Suppose I want to train DeepSpeech2 model on 10,000 hours of data. Is it a good idea to train the model incrementally (~2000 hours of data at a time) or train on all of data at once? Are there any advantages or disadvantages for both approaches?

I am not sure what could be advantage of incremental training. You would be then overfitting to each dataset at each training instance. With the use of all the data you should be able to generalise better.

I did run some tests with two smaller datasets (35/180h), with the bigger one I had a little improvement, but not with the smaller one.
You can find the script I used here: https://github.com/DanBmh/deepspeech-german/blob/master/training/cycled_training.py

There also wasn’t a great improvement using the 180h checkpoint vs transfer learning with the English checkpoint for training with german common voice (500h).

So incremental training may improve your results, but in my opinion it’s not worth the extra training time.