Is distributed training across multiple machines still supported?

aatallah · January 14, 2021, 4:32am

Hello,

Am looking at training a DeepSpeech model across multiple machines (each machine has several GPUs). I looked at the docs and looked online, and found mentions of “run-cluster.sh” and ps_host and worker_hosts, and such. However, none of these things seem to be present in the current master branch.

Hopefully this is not a repeated question.

Thank you!!
Antoine

othiele · January 14, 2021, 8:14am

Sorry, this feature was dropped as it is too difficult to maintain. There was some discussion about it a couple of month ago here.

aatallah · January 14, 2021, 10:28am

Ah, ok! Thank you for the info

lissyx · January 14, 2021, 4:59pm

@aatallah What kind of usecase do you have that would benefit so much from this ? I’m curious.

NanoNabla · January 20, 2021, 8:14pm

@lissyx it is a typical approach in classical numeric and I saw some other deep learning application which also profit from this in their training phase.
If you have a cluster with only a few GPUs per node but fast connected nodes it does not sound too bad.

Because DeepSpeech dropped maintainance of distributed training but we are interested in such on our cluster at TU Dresden. I began writing code to use horovod for distributed training.
You can give it a try on our Github
I planed to do a PR when its ready, since the changes did not seemed to be to dramatically.

aatallah · January 20, 2021, 10:27pm

I have 6 machines, all on the same gigabit switch, and each has two GPUs Want to make the most of the resources available if I can.

aatallah · January 20, 2021, 10:28pm

Thank you! Will give that a shot!

lissyx · January 20, 2021, 10:35pm

the question here is more related to:

training dataset size
setup complexity
hardware costs

Honestly, I train in < 18h on a desktop at home with 2x RTX 2080 Ti for more than 1000h of audio.

Our cluster in Berlin is made of nodes with 8 GPUs in each.

We had few usage, considering the described setup above, and it was hurting maintenance a lot.

I remember someone doing a PR about Horovod, but it was not very well integrated and the PR fell in limbo and we never heard back, so it’s good.

I don’t think your switch will handle. What GPUs do you have? FTR, PCIe monitoring on the RTX2080Ti here shows transfers ~8GB/s.

We looked at network-based interconnect at some point, but we concluded to be efficient you actually need around 40Gb networking class, even 10GbE was not decent enough, and the extra GPU power would not be used really because of data transfers.

NanoNabla · January 20, 2021, 10:46pm

I agree that gigabit switch is much too slow to get real performance. Nevertheless, 100Gb EDR InfiniBand is not unusal on hpc machines.

lissyx · January 20, 2021, 10:49pm

Of course, but again, given what we achieve with a few thousands hours and gamer–grade GPUs, that’s really why I’m curious in knowing why some people would really depend on this level of power for training, because it does sound overkill from my perspective. And I can’t imagine people forking and patching without good reasons. So I’m curious.

AndreasGocht · January 22, 2021, 8:26am

Of course, but again, given what we achieve with a few thousands hours and gamer–grade GPUs, that’s really why I’m curious in knowing why some people would really depend on this level of power for training, because it does sound overkill from my perspective.

I see plenty of reasons. But mostly I think its simply research interest. We do have an HPC machine and a few of these Power9 nodes. So the question is, can we profit from it? Does it Scale? If not: what do we need to do, to make it Scale? Can we get down to 1h for training? Or maybe only a few minutes?
We are looking mostly into Performance, but our colleagues look more into automatic optimisation of AI. They would profit from a lower training time.

I remember someone doing a PR about Horovod, but it was not very well integrated and the PR fell in limbo and we never heard back, so it’s good.

I think I have seen the paper presentation for this PR a few years back. However, our tools only support MPI for parallel performance analysis. So we wanted to have a look onto it again.

lissyx · January 22, 2021, 8:50am

As said, the multi node training pipeline was making it impossible to perform some improvements to the codebase, and there was no real use on our side, so it was poorly tested, hence why it got removed.

If someone steps up to implement a robust solution and ensure it is supported, we would of course gladly welcome it.

Right.

What GPU, out of curiosity?

NanoNabla · January 22, 2021, 8:57am

6x NVIDIA V100 per Node
Other information on this part of our cluster are available on
https://doc.zih.tu-dresden.de/hpc-wiki/bin/view/Compendium/Power9

lissyx · January 22, 2021, 9:06am

So you could train on ~192 GPUs ? What dataset size are you working on ?

AndreasGocht · January 22, 2021, 9:36am

I totally understand that. We are still looking into it, but currently it looks like there are not too many changes needed, at least for a basic horovod setup.

We are in the process of getting some of the new A100 as well.

Yes, we could. However, the question is: Is it worth it? Or would we just get a speedup of e.g. 8? And another question: how does the distribution impact the accuracy? Thats why we are looking into it.

NanoNabla · January 22, 2021, 9:42am

Yes, this is what we can do in theory. Since it is a shared cluster in the university not only for our performance group you have to request ressources which are scheduled among other user’s jobs depending on available ressources, expected runtime of your job,… However, Getting 24-36 GPUs is no big deal at the moment.
So we have to balance waiting time for ressources and saved time while running.

At the moment simply Common Voice dataset in english, since we are at the beginning. Bigger (open) datasets are also welcome in our research.

dan.bmh · January 29, 2021, 8:37am

You could try the new implementation in DeepSpeech-Polyglot. As it’s written in tensorflow2 you should be able to replace the distribution strategy (currently only multi-gpu) with a few lines of code and some extra setup on each node.
But please note that the project is still very experimental and misses some of DeepSpeech’s features.

NanoNabla · March 16, 2021, 7:39am

I dig up this thread just to announce distributed training (using horovod) is now merged into DeepSpeech master.
For Information how to use it, take a look into doc/TRAINING.rst.

Topic		Replies	Views
Missing distributed training on latest versions DeepSpeech learning , issue	4	532	October 9, 2020
Distributed Training with Horovod DeepSpeech	10	924	July 23, 2020
Distributed training set up of DeepSpeech code DeepSpeech	8	1795	July 20, 2018
Distributed training without GPU DeepSpeech	1	425	November 23, 2018
I have a supercomputer at my hands, how to train DeepSpeech in Parallel? DeepSpeech	18	2155	July 13, 2020

Is distributed training across multiple machines still supported?

Related topics