Error when loading a sequence of models in python

I’m trying to create a python script that would plot learning curve showing how the model’s accuracy changes with growing length of training data but that ends with an error.

Model training is started as an external subprocess and that part works fine.

I’ve installed deepspeech-gpu==0.2.1a1 for python 3 with pip, taken client.py and modified it to load the newly generated model and run inference of test data on it so that the model, language model etc. are loaded just once for the test data evaluation.

The first model is evaluated fine, the problem starts once the second model is trained and the script tries to load it for inferences:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor of shape [2048,2048] and type float
         [[{{node h2/Adam/Initializer/zeros}} = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [2048,2048] values: [0 0 0...]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

and the simplified version of the script looks like this:

from deepspeech import Model
for train_duration in train_durations:
    train_list_path = create_test_list(training_directory, train_duration)
    trained_model = train_model(original_model, checkpoint_dir, export_dir, epoch_number, train_list_path, validation_list_path, test_list_path, learning_rate, deepspeech_directory)
    ds = Model(model, N_FEATURES, N_CONTEXT, alphabet, BEAM_WIDTH)
    ds.enableDecoderWithLM(alphabet, lm, trie, LM_WEIGHT, VALID_WORD_COUNT_WEIGHT)
    train_score = infer_audio_list(train_list_path, ds)
    test_score = infer_audio_list(test_list_path, ds)
    del ds

It’s most probably related to the previous model not being released but I haven’t found a way to release the previous model from code.

Does anyone have an idea how to fix this?

The OOM indicates a GPU-side memory issue, what’s your GPU ? Batch size ? Dataset ?

  • GPU is Tesla M60, 7.44GiB memory

  • finetuning from the default frozen model output_graph.pb with a very small dataset (about 10-200 files, 5 seconds each)

  • batch size used is the default one (1 from what I can see in Deepspeech.py)

  • the issue is related to the GPU memory in this way (observed from nvidia-smi):

  1. main script starts, no GPU memory is used
  2. first round of training starts in - training subprocess gets all GPU memory
  3. training subprocess ends - GPU memory is relased
  4. main script starts running inference and gets all GPU memory
  5. inference ends but GPU memory is still allocated to the main script
  6. training subprocess starts but there’s no GPU memory left as main process is still keeping it

A workaround would probably be to run the inference in its own subprocess and when that’s finished, GPU memory would be freed for second round of training.

My question is, if there’s a way to release the GPU memory allocated for the main script after the first inference ends, the “del ds” part is an attempt on doing so but that didn’t help.

Well, we don’t directly control it, it’s up to TensorFlow / CUDA to deal with that.

That looks like a strange and slow way. Why not:

from deepspeech import Model

ds = Model(model, N_FEATURES, N_CONTEXT, alphabet, BEAM_WIDTH)
ds.enableDecoderWithLM(alphabet, lm, trie, LM_WEIGHT, VALID_WORD_COUNT_WEIGHT)

for train_duration in train_durations:
    train_list_path = create_test_list(training_directory, train_duration)
    trained_model = train_model(original_model, checkpoint_dir, export_dir, epoch_number, train_list_path, validation_list_path, test_list_path, learning_rate, deepspeech_directory)
    train_score = infer_audio_list(train_list_path, ds)
    test_score = infer_audio_list(test_list_path, ds)

?

Basically, each time you call Model() it would reload everything, which is slow.
I’m still surprised about step 5. Can you reproduce that behavior with a smaller dataset? And maybe share a ready-to-use reproducer

I need to run the inference on the newly trained model not on the original one

Ok, that makes sense then. Can you check without using the language model?

@yv001 Could you try forcing a garbage collection after your del ds statement ?

import gc

for [...]
  [...]
  del ds
  gc.collect()

https://github.com/tensorflow/tensorflow/issues/20387

Checking that issue @yv001, it seems very close. In native_client/python/__init__.py you can see that upon __del__() we do call DS_DestroyModel() . In deepspeech.cc you can check that when we reach this, then we destroy the ModelState, which, in turn, is in charge of closing the TensorFlow session and release objects we allocated.

If closing the session is not enough to release the memory :frowning:

Just for the record, disabling lm did not help.

I’m seeing people able to tackle this (what looks to be so far) TensorFlow limitation by forcing a cudaDeviceReset() call. Looks like you might want to try that, i’ve seen it in the documentation of the numba python module.

This should free resources allocated in the context of the process (here, the Python process) and thus it should be safe.

@lissyx I think you’re right with the issue being related to the Tensorflow limitation.

  • just running gc.collect() does not help

  • I’ve tried the numba close() approach described in tensorflow forum and the second training run ok, however the second model initialization fails with

    E tensorflow/stream_executor/cuda/cuda_driver.cc:655] failed to memset memory: CUDA_ERROR_INVALID_VALUE: invalid argument
    Failed precondition: Failed to memcopy into scratch buffer for device 0

But that’s probably expected as it’s described as a drawback a few messages later in the same tensorflow forum

So to sum up, this does not help either for repeated inferences:

from  numba import cuda
for [...] 
    [...] 
    del ds
    cuda.select_device(0)
    cuda.close()

Even though having the option or running several models (sequentially or even better in parallel) in one python script would be more elegant, I’ll resort to the subprocesses for now.

I’m going to try the refactoring of the main script to call the inference as a separate subprocess and collect the inference results in a file and see how it goes.

Thanks for your suggestions

Thanks for your testing, I’m sorry this is not really in our hands :/.