Vocoder (dev) - potential memory usage issue in training?

nmstoker · July 22, 2020, 12:05pm

Whilst training the integrated Vocoder from dev, I ran into a problem v late last night where the computer eventually ran out of memory and the task was killed - I’ve got 32Gb RAM on the machine so it was a bit of a surprise!

It had reached a global step of just past 283062. There was no error message, it simply stopped with “Killed” (which came from the OS I believe). RAM usage remained high and there were no other significant tasks running.

I’ll need to investigate more this evening, but I did see this message in the logs, from early on:

/home/neil/main/Projects/TTSJul2020/TTS20Jul2020dev/TTS/tts/utils/visual.py:37: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`).
>   fig = plt.figure(figsize=fig_size)

It appeared right after one of the early evaluations. It might be a red-herring, but given that it mentions memory, perhaps it is connected.

I saw this issue but that doesn’t appear to be with the vocoder (output suggests regular TTS model training run)

Q. Has anyone else run into issues with memory usage whilst training the vocoder?

othiele · July 22, 2020, 12:25pm

We just ran a vocoder up to 200k with 16 GB and didn’t have that problem, we have an exploding generator … but not with the new integrated vocoder, sorry.

sanjaesc · July 22, 2020, 1:12pm

Hm it seems like the figures which are created during the tensorboard logging routine aren’t closing correctly? If you follow the function calls, the figures should close once they are written to tensorboard.

No problems so far.

nmstoker · July 22, 2020, 11:52pm

Thank you both!

I had three stages:

Initial training to 200k - getting better over time and approaching quite reasonable quality (although still plenty of room to improve)
From 200k something must have kicked in, as the results suddenly got worse: the charts show it clearly and the voice quality had seemed to be getting worse overall (but it wasn’t totally clear cut)
The the memory issue caused it to stop at ~276k and continuing training from that point resulted in a huge jump (partly expected after restarting) but the voice quality became complete jibberish - I was working, so I let it run on, but there was no discernible improvement in audio, so I stopped it again just now.

@sanjaesc as you suggest, by following through the code for the figures, I see that there are plots created (here and here) which are never closed - the points shown are both followed by the code that writes them to tensorboard, so a close right after that should work as you’d indicated. In the latest code that still seems to be the case but with the refactor, the locations are here and here. I’ll see about creating a PR for it at the weekend.

As for the jump at 200k, I guess I’ll need to look at some of the settings - I’m thinking it might be the start of the discriminator, which is set to start at 200k as per here: https://github.com/mozilla/TTS/blob/dev/TTS/vocoder/configs/multiband_melgan_config.json#L83

Maybe a good place to start is comparing the discriminator here against that in the MelGAN implementation in Eren’s fork of PWGAN here: https://github.com/erogol/ParallelWaveGAN/blob/tts/parallel_wavegan/melgan_v4.yml#L124 as it seems to start earlier (100k) and have a lower learning rate (5.0e-5 vs 1e-4 in the integrated version) Anyway, I think that may also need to wait till the weekend when I’ll have more time.

If it’s handy for anyone, here are my tensorboard charts. The correspondence with the three sections I mention above is visible fairly clearly in some of them charts (especially the fifth one from left, top row)

sanjaesc · July 23, 2020, 7:40am

Yes, the function will create and return a dictionary of figures. Those figures are then passed through to tb_train_figures. Following it from here the figures should be closed at some point by a call to “figure_to_image” in tensorboardX/utils.py. Sorry if I am misunderstanding you, but shouldn’t the same happen in the refactor?

nmstoker · July 23, 2020, 12:02pm

Ah, I see now. Sorry, I had been looking for the TTS code to close them but you’re right, in TensorboardX it defaults to closing the figures itself so this shouldn’t be causing a problem (both before and after the refactor).

I will need to look further, later on. I wonder if it’s not removing them if a reference is being kept somewhere. There are a number of Google results for this warning, so I should be able to get to the bottom of it.

erogol · July 23, 2020, 12:58pm

it might be that the vocoder data-loader keeps data in memory if you set use_cache: true thus it might cause OOM.

nmstoker · July 24, 2020, 5:01pm

I tried a quick solution last night to go through each of the Figures in the dictionary of figures and close them explicitly and set the dictionary to None after they’re used, but I still see the message.

@erogol thanks for mentioning the cache setting, I’ll take a look there too. it could well be something else such as that behind the memory issue and the figures warning is just a distraction. Nearly the weekend so I’ll be able to look in more detail then

nmstoker · July 27, 2020, 7:09pm

@erogol - I have tried it with the cache setting turned off and that has kept the memory usage to a reasonable level so far. It has only been running since lunch so I need to see how it holds up but I think this is helping. Thank you!

Can I ask about the comment below from your blog. Was that in relation to the integrated vocoder? (I guess it was but wasn’t 100% sure)

I trained MB-Melgan vocoder using real spectrograms up to 1.5M steps, which took 10 days on a single GPU machine. For the first 600K iterations, it is pre-trained with only the supervised loss as in [11] and than the discriminator is enabled for the rest of the training. I do not apply any learning rate schedule and I used 1e-4 for the whole training.

I actually kicked off the current run with steps_to_start_discriminator set to 200k as per the config in the repo but I’ll try to stop it after a checkpoint and continue training with it set to 600k - switching that before it reaches 200k should be okay shouldn’t it? (ie won’t be incompatible/stop it from restarting properly)

erogol · July 28, 2020, 9:44am

Yes you can switch to 600k before reaching to 200k. The best way is to check the validation loss and when it converges you can enable the discriminator.

nmstoker · August 1, 2020, 10:51pm

By the way, I’d found that even after disabling the cache setting I did in time see the memory creep up again (I hadn’t waited long enough when I checked if for my comment above).

This issue mentioned a similar memory problem and a fix suggested by Dan worked for me.

When I’d tried fixing it using the same principal, I’d put the close in too late and it had failed, whereas Dan’s approach completely stops the memory usage problem and doesn’t impact any of the charting.

erogol · August 2, 2020, 10:34am

Where do you exactly put .close() close in visual.py?

nmstoker · August 2, 2020, 12:48pm

I put it in the locations shown in Dan’s change:

In my attempts before, I’d tried to put the closes after the last point I thought the figures were being used but that didn’t have any effect. Where they are in Dan’s code did surprise me as I thought it would impact returning figures (which happens after the plt.close) but it hasn’t appeared to cause any problem with the charts in Tensorboard.

The third change in that code isn’t likely to impact the memory use in the training, as I believe that final bit of code is only for use in one of the Notebooks but it does look like it’s fixing an indentation issue there.

danwells · August 2, 2020, 1:07pm

I think if you wanted a less spooky change it would mean not using pyplot and doing everything with the object-oriented approach to matplotlib. Unfortunately every time I have to use matplotlib I’m basically learning it anew so I went for the quick fix

nmstoker · August 2, 2020, 1:11pm

This is the point at which I introduced the change from Dan’s code (blue arrow). I don’t have a compiled table of the memory use measures but it correlated closely with load time going up. Each of the drops was when I would periodically stop the training (which freed memory) and then restart. Due to chart smoothing and the time it takes, the impact took a while to be obvious.