Is distributed training across multiple machines still supported?

Hello,

Am looking at training a DeepSpeech model across multiple machines (each machine has several GPUs). I looked at the docs and looked online, and found mentions of “run-cluster.sh” and ps_host and worker_hosts, and such. However, none of these things seem to be present in the current master branch.

Hopefully this is not a repeated question.

Thank you!!
Antoine

Sorry, this feature was dropped as it is too difficult to maintain. There was some discussion about it a couple of month ago here.

1 Like

Ah, ok! Thank you for the info :slight_smile:

@aatallah What kind of usecase do you have that would benefit so much from this ? I’m curious.

@lissyx it is a typical approach in classical numeric and I saw some other deep learning application which also profit from this in their training phase.
If you have a cluster with only a few GPUs per node but fast connected nodes it does not sound too bad.

Because DeepSpeech dropped maintainance of distributed training but we are interested in such on our cluster at TU Dresden. I began writing code to use horovod for distributed training.
You can give it a try on our Github
I planed to do a PR when its ready, since the changes did not seemed to be to dramatically.

2 Likes

I have 6 machines, all on the same gigabit switch, and each has two GPUs :slight_smile: Want to make the most of the resources available if I can.

Thank you! Will give that a shot!

the question here is more related to:

  • training dataset size
  • setup complexity
  • hardware costs

Honestly, I train in < 18h on a desktop at home with 2x RTX 2080 Ti for more than 1000h of audio.

Our cluster in Berlin is made of nodes with 8 GPUs in each.

We had few usage, considering the described setup above, and it was hurting maintenance a lot.

I remember someone doing a PR about Horovod, but it was not very well integrated and the PR fell in limbo and we never heard back, so it’s good.

I don’t think your switch will handle. What GPUs do you have? FTR, PCIe monitoring on the RTX2080Ti here shows transfers ~8GB/s.

We looked at network-based interconnect at some point, but we concluded to be efficient you actually need around 40Gb networking class, even 10GbE was not decent enough, and the extra GPU power would not be used really because of data transfers.

I agree that gigabit switch is much too slow to get real performance. Nevertheless, 100Gb EDR InfiniBand is not unusal on hpc machines.

Of course, but again, given what we achieve with a few thousands hours and gamer–grade GPUs, that’s really why I’m curious in knowing why some people would really depend on this level of power for training, because it does sound overkill from my perspective. And I can’t imagine people forking and patching without good reasons. So I’m curious.

Of course, but again, given what we achieve with a few thousands hours and gamer–grade GPUs, that’s really why I’m curious in knowing why some people would really depend on this level of power for training, because it does sound overkill from my perspective.

I see plenty of reasons. But mostly I think its simply research interest. We do have an HPC machine and a few of these Power9 nodes. So the question is, can we profit from it? Does it Scale? If not: what do we need to do, to make it Scale? Can we get down to 1h for training? Or maybe only a few minutes?
We are looking mostly into Performance, but our colleagues look more into automatic optimisation of AI. They would profit from a lower training time.

I remember someone doing a PR about Horovod, but it was not very well integrated and the PR fell in limbo and we never heard back, so it’s good.

I think I have seen the paper presentation for this PR a few years back. However, our tools only support MPI for parallel performance analysis. So we wanted to have a look onto it again.

As said, the multi node training pipeline was making it impossible to perform some improvements to the codebase, and there was no real use on our side, so it was poorly tested, hence why it got removed.

If someone steps up to implement a robust solution and ensure it is supported, we would of course gladly welcome it.

Right.

What GPU, out of curiosity?

6x NVIDIA V100 per Node
Other information on this part of our cluster are available on
https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/Power9

So you could train on ~192 GPUs ? What dataset size are you working on ?

I totally understand that. We are still looking into it, but currently it looks like there are not too many changes needed, at least for a basic horovod setup.

We are in the process of getting some of the new A100 as well.

Yes, we could. However, the question is: Is it worth it? Or would we just get a speedup of e.g. 8? And another question: how does the distribution impact the accuracy? Thats why we are looking into it.

Yes, this is what we can do in theory. Since it is a shared cluster in the university not only for our performance group you have to request ressources which are scheduled among other user’s jobs depending on available ressources, expected runtime of your job,… However, Getting 24-36 GPUs is no big deal at the moment.
So we have to balance waiting time for ressources and saved time while running.

At the moment simply Common Voice dataset in english, since we are at the beginning. Bigger (open) datasets are also welcome in our research.

You could try the new implementation in DeepSpeech-Polyglot. As it’s written in tensorflow2 you should be able to replace the distribution strategy (currently only multi-gpu) with a few lines of code and some extra setup on each node.
But please note that the project is still very experimental and misses some of DeepSpeech’s features.

1 Like

I dig up this thread just to announce distributed training (using horovod) is now merged into DeepSpeech master.
For Information how to use it, take a look into doc/TRAINING.rst.

2 Likes