Hi all, I’ve been trying to train an acoustic model using release v0.8.2, but my process gets killed after a certain amount of time. Checking the kernel logs, it seems to be an OOM issue, but on the system level, not GPU.
The command I’m running to invoke training is: python -u DeepSpeech.py --train_files ../speech-transcription/training/train_data.csv --train_batch_size 1000 --n_hidden 2048 --epochs 30 --verbosity 1
This same OOM error occurs when batch size is set to 1.
Has anyone experience a similar issue when training? I have a hunch it’s an issue with my hardware, but I’m not sure. I’m considering executing a training run on an AWS EC2 P3 instance to verify my training data is good, but it should be, it’s all 16kHz wav files.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
this is linux-level oom, not GPU one, so it’s not surprising batch size does not impact.
@lissyxfree -hw returns: total used free shared buffers cache available Mem: 62Gi 688Mi 14Gi 6.0Mi 268Mi 47Gi 61Gi Swap: 2.0Gi 700Mi 1.3Gi
The cache memory of 47 GiB seems like it could be causing issues, I’ll look into clearing that.
@othiele So after collecting some stats, the longest audio sample I have is 8999 seconds. This is unintended, so I will be adjusting my extraction script to avoid this. However, that’s only 309 MiB, which doesn’t sound like it would be an issue with the memory I have.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
5
So you only have 64GiB free.
That’s 309MiB in raw data, which then needs to be pass through the pipeline for MFCC. This might well require much more data.
Understood, I’ve reduced the maximum training sample length to 120s, and cleared my cache memory with echo 3 | sudo tee /proc/sys/vm/drop_caches. I’ve just started a new training run, I’ll report back how it performs.
For the sake of documentation, free -hw just before beginning training now reads: total used free shared buffers cache available Mem: 62Gi 683Mi 61Gi 6.0Mi 5.0Mi 254Mi 61Gi Swap: 2.0Gi 700Mi 1.3Gi
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
7
Also, make sure you have no other process interacting, especially with the GPU.
Most people train with 8000 x 10 second long chunks. No wonder you get an OOM. And you probably have less than 1000 files, so use a smaller batch size of 32 or 64.
My average sample length is 18 seconds with a max of 120. I’ve got just under 300,000 speech samples to train with.
I’m going to cut down the size of my training set and see if I can complete at least 1 epoch.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
10
You’re still running with GNOME in parallel. How can you ensure you’re not eating RAM away?
Well, after the process is killed, it’s not unexpected that you can reclaim the whole memory. Besides, 61GiB in Free ? What about cache? How does the memory usage grows?
OK, you’ll probably need to use shorter chunks. Try a run with chunks not longer than 10 seconds. That shouldn’t be a problem. If that fails you have a different problem, but most people use shorter chunks of up to 15 seconds.
I’ve built a script to keep track of memory usage alongside training, I’m going to run that overnight with a new training set I’m currently extracting with no samples > 60 seconds.
I’m no expert, but in general it looks like the memory usage is not static, which seems problematic.
I’m going to try a training run with an open source set (WSJ probably) and see if results are any different. I may also attempt a run on an AWS instance to A/B test potential hardware issues.
Correct, this is system memory obtained by calling free -w. The system hardware info is in the original post. The data chunks are on average of 12 seconds (max 60 s) of 16kHz 16-bit audio, and I used a batch size of 128.
This shouldn’t happen, but not many people train with files of such a high variance in length. Might be some strange memory leak. As I suggested earlier, I would train with chunks of max. 12-15 seconds in lenght and see whether this happens again.
So I’ve been able to successfully complete a training run with the 100hr librispeech set, meaning the issues must be with the data I’m using. Here’s the memory usage over time during the completed run (This is about 18 epochs, I stopped memory collection mid-epoch).
Memory usage does seem to creep up over each epoch. Would this indicate a memory leak? Shouldn’t there be pretty static memory usage between batches?
I did notice during my testing that the process got killed after a certain amount of samples were processed in a single epoch, independent of batch size. For example, a batch size of 128 would be killed around step 255 within the epoch, a batch size of 1 would be killed around step 32,640 (255 * 128).
The first graphs I showed were killed within the first epoch.