System memory OOM issue when training on v0.8.2

alex_cannan · November 2, 2020, 3:22am

Hi all, I’ve been trying to train an acoustic model using release v0.8.2, but my process gets killed after a certain amount of time. Checking the kernel logs, it seems to be an OOM issue, but on the system level, not GPU.

The command I’m running to invoke training is:
python -u DeepSpeech.py --train_files ../speech-transcription/training/train_data.csv --train_batch_size 1000 --n_hidden 2048 --epochs 30 --verbosity 1

This same OOM error occurs when batch size is set to 1.

Hardware specs:
OS: Ubuntu 20.04.1 LTS (Focal Fossa)
MOBO: MSI X299 Pro
CPU: Intel i9-10900X 10-core 3.7 GHz
RAM: 64 GB Corsair Vengeance 3200 MHz
GPU: Nvidia RTX Titan 24 GB

I uploaded the relevant kernel logs to pastebin: https://pastebin.com/NgS94xPQ

Has anyone experience a similar issue when training? I have a hunch it’s an issue with my hardware, but I’m not sure. I’m considering executing a training run on an AWS EC2 P3 instance to verify my training data is good, but it should be, it’s all 16kHz wav files.

lissyx · November 2, 2020, 9:43am

this is linux-level oom, not GPU one, so it’s not surprising batch size does not impact.

that should be enough, but how much is free ?

othiele · November 2, 2020, 10:31am

Adding to @lissyx and for future reference. This is quite unusual, OOMs occur mostly on the GPU.

Tell us more on your data, how long is the longest audio input?

alex_cannan · November 2, 2020, 1:58pm

@lissyx free -hw returns:
total used free shared buffers cache available
Mem: 62Gi 688Mi 14Gi 6.0Mi 268Mi 47Gi 61Gi
Swap: 2.0Gi 700Mi 1.3Gi

The cache memory of 47 GiB seems like it could be causing issues, I’ll look into clearing that.

@othiele So after collecting some stats, the longest audio sample I have is 8999 seconds. This is unintended, so I will be adjusting my extraction script to avoid this. However, that’s only 309 MiB, which doesn’t sound like it would be an issue with the memory I have.

lissyx · November 2, 2020, 2:00pm

So you only have 64GiB free.

That’s 309MiB in raw data, which then needs to be pass through the pipeline for MFCC. This might well require much more data.

alex_cannan · November 2, 2020, 2:17pm

Understood, I’ve reduced the maximum training sample length to 120s, and cleared my cache memory with echo 3 | sudo tee /proc/sys/vm/drop_caches. I’ve just started a new training run, I’ll report back how it performs.

For the sake of documentation, free -hw just before beginning training now reads:
total used free shared buffers cache available
Mem: 62Gi 683Mi 61Gi 6.0Mi 5.0Mi 254Mi 61Gi
Swap: 2.0Gi 700Mi 1.3Gi

lissyx · November 2, 2020, 2:13pm

Also, make sure you have no other process interacting, especially with the GPU.

othiele · November 2, 2020, 2:49pm

Most people train with 8000 x 10 second long chunks. No wonder you get an OOM. And you probably have less than 1000 files, so use a smaller batch size of 32 or 64.

alex_cannan · November 2, 2020, 3:49pm

Training just got killed again after 1:24:18. After training, there is still 61 GiB of free memory. Here’s the latest oom kernel logs: https://pastebin.com/zRf6bAXf

My average sample length is 18 seconds with a max of 120. I’ve got just under 300,000 speech samples to train with.

I’m going to cut down the size of my training set and see if I can complete at least 1 epoch.

lissyx · November 2, 2020, 4:09pm

You’re still running with GNOME in parallel. How can you ensure you’re not eating RAM away?

Well, after the process is killed, it’s not unexpected that you can reclaim the whole memory. Besides, 61GiB in Free ? What about cache? How does the memory usage grows?

othiele · November 2, 2020, 5:02pm

300 000 files in 300MB with average of 18 seconds doesn’t sound realistic.

What type of audio (ffmpeg -i)?

alex_cannan · November 2, 2020, 6:08pm

Sorry, what I meant is my largest file is 300 MB. The total size of the training set is much larger, 161.18 GiB. It’s all 16-bit, 16000 Hz wav files.

othiele · November 2, 2020, 7:35pm

OK, you’ll probably need to use shorter chunks. Try a run with chunks not longer than 10 seconds. That shouldn’t be a problem. If that fails you have a different problem, but most people use shorter chunks of up to 15 seconds.

alex_cannan · November 2, 2020, 11:19pm

I’ve disabled any window manager and get the same error: https://pastebin.com/yHMrfeSv

I’ve built a script to keep track of memory usage alongside training, I’m going to run that overnight with a new training set I’m currently extracting with no samples > 60 seconds.

alex_cannan · November 10, 2020, 4:27pm

Sorry about the delay, here’s my memory usage over time in my training run:

I’m no expert, but in general it looks like the memory usage is not static, which seems problematic.

I’m going to try a training run with an open source set (WSJ probably) and see if results are any different. I may also attempt a run on an AWS instance to A/B test potential hardware issues.

othiele · November 10, 2020, 5:54pm

Just to clarify, that is CPU memory? And please describe the system and the length of the data chunks you used.

alex_cannan · November 11, 2020, 2:20am

Correct, this is system memory obtained by calling free -w. The system hardware info is in the original post. The data chunks are on average of 12 seconds (max 60 s) of 16kHz 16-bit audio, and I used a batch size of 128.

othiele · November 11, 2020, 8:24am

This shouldn’t happen, but not many people train with files of such a high variance in length. Might be some strange memory leak. As I suggested earlier, I would train with chunks of max. 12-15 seconds in lenght and see whether this happens again.

And this is just one epoch or what do we look at?

alex_cannan · November 13, 2020, 4:28pm

So I’ve been able to successfully complete a training run with the 100hr librispeech set, meaning the issues must be with the data I’m using. Here’s the memory usage over time during the completed run (This is about 18 epochs, I stopped memory collection mid-epoch).

Memory usage does seem to creep up over each epoch. Would this indicate a memory leak? Shouldn’t there be pretty static memory usage between batches?

I did notice during my testing that the process got killed after a certain amount of samples were processed in a single epoch, independent of batch size. For example, a batch size of 128 would be killed around step 255 within the epoch, a batch size of 1 would be killed around step 32,640 (255 * 128).

The first graphs I showed were killed within the first epoch.