I am trying to train a DeepSpeech model across multiple machines with one GPU each. Until v0.4.0 there was support for training using distributed TensorFlow, but as of v0.5.0 this feature seems to be gone and I cannot find any documentation about it.
Is distributed training still supported? Maybe on a different branch? I remember having read in this forum that cluster mode development was kept back in recent versions.
I would appreciate any guidance.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
As much as I can recall, we removed it because it was blocking some improvements to the feeding process and it was not as efficient as we expected. I think @reuben can complement.
Yep, in most larger setups you would have 4+ V100s on one machine. So you would need thousands of hours to run for a long time. And yes, people let models train for weeks on smaller machines.