Data augmentation slowing training

bernardohenz · April 4, 2018, 9:30pm

Hi, I am trying to use on-the-fly data augmentation on the audios.
My first idea was to insert augmentation operations right before extracting features from the audio (i.e., right here https://github.com/mozilla/DeepSpeech/blob/master/util/audio.py#L67).

For this, I’ve used a python wrapper for SoX (https://github.com/carlthome/python-audio-effects), and it worked very well. The code seems like this


from pysndfx import AudioEffectsChain

def audiofile_to_input_vector(audio_filename, numcep, numcontext):
    r"""
    Given a WAV audio file at ``audio_filename``, calculates ``numcep`` MFCC features
    at every 0.01s time step with a window length of 0.025s. Appends ``numcontext``
    context frames to the left and right of each time step, and returns this data
    in a numpy array.
    """
    # Load wav files
    fs, audio = wav.read(audio_filename)

    aug_fx = AudioEffectsChain()
    if (random.random()<0.9):
        aug_fx.tempo( random.uniform(0.8,1.2) )
    aug_out = aug_fx(audio)

    return audioToInputVector(aug_out , fs, numcep, numcontext)

The accuracy improved more than 5%.
Unfortunately, the training is like 10x slower. I noticed that during training, sometimes the GPU stays idle while the cpu is almost at 100%. My guess is that this wrapper is a bottleneck (as it only generates a command that is processed later), and it does not make use of parallel threads of deepspeech.

Do you have any idea how can I solve this?

kdavis · April 6, 2018, 6:28am

I’d guess the bottleneck is indeed the the

aug_out = aug_fx(audio)

call as it’s processing the audio each time it grabs a batch of audio. (It is however, using multiple threads.)

Balancing the GPU and CPU is a bit of an art. We took sometime to make sure our GPU wasn’t starved waiting for the CPU.

One idea would be to increase the threads per queue[1]

class ModelFeeder(object):
    '''
    Feeds data into a model.
    Feeding is parallelized by independent units called tower feeders (usually one per GPU).
    Each tower feeder provides data from three runtime switchable sources (train, dev, test).
    These sources are to be provided by three DataSet instances whos references are kept.
    Creates, owns and delegates to tower_feeder_count internal tower feeder objects.
    '''
    def __init__(self,
                 train_set,
                 dev_set,
                 test_set,
                 numcep,
                 numcontext,
                 alphabet,
                 tower_feeder_count=-1,
                 threads_per_queue=2):

        self.train = train_set
        self.dev = dev_set
        ...

So the CPU can keep up with the GPU.

However, there’s also the question as to how many threads your CPU supports. If that’s already maxed out, increasing threads_per_queue will not help.

bernardohenz · April 6, 2018, 1:50pm

I’ve tried with threads_per_queue=8, and it improved the speed performance for like 20~25%. Nonetheless, it is still much slower than running without augmentation.

I don’t think that the augmentations I am using are computationally expensive, it must be something related to how the lib is working.

I’ll keep trying to make it faster, but even slow like this is now, the accuracy improving is good

Thanks for the help

kdavis · April 6, 2018, 1:59pm

You are welcome!