Missing distributed training on latest versions

Hi,

I am trying to train a DeepSpeech model across multiple machines with one GPU each. Until v0.4.0 there was support for training using distributed TensorFlow, but as of v0.5.0 this feature seems to be gone and I cannot find any documentation about it.

Is distributed training still supported? Maybe on a different branch? I remember having read in this forum that cluster mode development was kept back in recent versions.

I would appreciate any guidance.

As much as I can recall, we removed it because it was blocking some improvements to the feeding process and it was not as efficient as we expected. I think @reuben can complement.

Exactly, we weren’t using it anymore, so it was a maintenance burden and slowing down development, but with no benefit.

So if I understand this correctly, there’s no alternative to the old distributed training.

And the expectation is, that training of large datasets will run for a long time on a single machine.

Yep, in most larger setups you would have 4+ V100s on one machine. So you would need thousands of hours to run for a long time. And yes, people let models train for weeks on smaller machines.