Training hardware bottlenecks

Hi!

I’m doing training experiments on a couple of machines. Both machines have a single RTX2080Ti in them but different RAM and CPU setups. The thing I’m seeing is that I’m very CPU limited. One setup have a 4 thread i5 and the other have a 8 thread i7. Both of the CPUs seems to struggle to keep the RTX card occupied. The RTXs are never 100% loaded and just idles sometimes.

My question is: how much would I need to bump the CPU in order to max out a single RTX2080Ti? Does the training like many cores or do I need high clock speeds?

Best regards

Do you use phoneme based model ?

I’m currently running experiments with and without phonemes. It’s a lot faster without but the CPU is still maxed out without taxing the 2080Ti more that 30%.

It is normal if it is only the first epoch since it is caching phonemes for the coming epochs. If after the first epoch it is still slow, then there might be something wrong

The phonemes experiment is well past the first epoch and I noticed an increase in GPU usage after the first epoch. But now the i7 won’t load the 2080Ti more than 30%. Would you have any suggestion on how to debug this?

I already print loader time and step time. You can check these values if there is an exceptional delay. Otherwise, that might be about your batch_size. You can try to change it and see the result on GPU utilization. Using too much loader processes might also slow down the process.

I’m definitely hitting this same issue. I have an RTX 2080Ti and an 8 core, 16 thread i9 9900K. GPU utilisation bounces between 30% and 65%, with the CPU pegged at 100%. This kind of CPU bottlenecking doesn’t seem plausible to me.

I’ve tried turning off the phoneme thing to see if that at least stops the bottlenecking but it doesn’t seem to make a difference, even after the first epoch + validation is over.

If I do manage to come up with a solution, I’ll post it here.

Can you give the specs of the machine you usually run the model on @erogol ? If the specs are much better than mine that might explain it.

I’ve spent an afternoon tweaking the config (esp. the batch size, and presence/absence of encoding + stopnet) to no avail. It seems the preprocessing during training is very CPU heavy.

have you checked how many processes your dataloader uses? You can also try to run your model with OMP_NUM_THREADS=1 python train.py ...

1 Like

I can try that tomorrow, hopefully it helps.

I tried setting the number of data loader processes to 8 with the config file (I didn’t go higher because the comments suggested 4-8 with 4 as the default).

At the moment all 16 threads are maxed out during training so it seems odd to want to change the number of threads but it’s worth a shot. Thanks

Setting OMP_NUM_THREADS=1 ended up working for me. I’m using docker, so I set the variable in the dockerfile, but otherwise I would have had to set it like any other environment variable.

Now the model is running at 80% GPU load and I’m trying to squeeze out the last 15-20% of performance, but I consider this problem solved.

2 Likes

Hi - I have similar issues. I use a RTX 2070 (8 GB of memory) and a (rather aged) quadcore Xeon E3-1230 V2 (3,3 Ghz). The CPU seems to be utilized by 50-60% during training using 4-6 threads.

I refer to Tacotron1 and no neural Vocoder (most config settings incl. BS are as default).

However, my GPU is rather bored most of the time… 0-35% utilization… without having any clue about the source code - is this as expected or where is the bottleneck on my side?

This thread really helped me!! Setting OMP_NUM_THREADS worked for me too, my training feels much faster now. Did you made any progress in regards to the remaining 15-20% that you could share?

Hey, can anyone give advice here, please? The OMP_NUM_THREADS thingy is only related to MultiProc/GPU training… training as such is horribly slow, so well-documented tips&tricks here would do miracles - for everyone. Thx.

One guess, if you have a HDD (not SDD) data loading might be a bottleneck.

But we all now how to code I assume. Be vigilant to debug the code and see where the bottleneck is in your case.

One easy fix might be to write a new dataloader which ingest pre-computed spectrograms instead of computing them on fly. (If the bottleneck is about HDD or spectrogram computation)

Hi - thanks for sharing your insights. Only in the beginning, data is loaded (yes, it is an HDD), then it seems cached in memory, not much IO going on… I haven’t benchmarked it yet. Rather thought, there are some general misconceptions or mistakes made in using the code. I can try to benchmark it and share any new insights if made.

So in your opinion, (more or less) using full capacity on the GPU is possible?

Ohh yeah there is also the phoneme computation in the first epoch, if use of phoneme is enabled. That should be fixed after the first epoch.

I think the difference is that my dataset is smaller than LJSpeech, so the amount of time spent logging things to tensorboard is a greater proportion of the training time (since there’s more epochs per cycle with less data).

So I did add in some config variables to reduce how often it writes to the tensorboard logs. This helps for sure, but it didn’t get me a large speedup.

I do still have some questions about the way multiprocessing is implemented in distribute.py in particular. I think it spins up too many worker threads even if you have OMP_NUM_THREADS=1. in my test, the epoch time was shorter (with much lower CPU utilisation) overall once I set "num_loader_workers" and "num_val_loader_workers" both equal to 1.