Data Augmentations

mrinal1 · September 12, 2019, 11:40am

Hi Team,

I had a query about the data augmentations added recently:

Do these augmentations substitute my original spectograms (overwrite after changing), or are they appended to the batches to increase the overall data for training?

reuben · September 12, 2019, 11:47am

Substitute. You should disable feature caching (entirely, remove the .cache() call from the dataset pipeline) when training with augmentation, then the data will be augmented on every epoch and each epoch will use new data.

mrinal1 · September 12, 2019, 12:08pm

Thanks for the reply and the much needed augmentation work.
How can we test the increase in data?
We just removed the .cache and ran the training but it shows the same no. of steps and the same time taken per epoch.

reuben · September 12, 2019, 12:12pm

I don’t understand your question. The augmentation is performed in-memory, the augmented data isn’t saved anywhere. We still do the same number of steps.

mrinal1 · September 12, 2019, 12:15pm

So, if each epoch uses new data, I am expecting an increase in time taken per epoch or an increase in the number of steps. Is this assumption not correct? If so, how do I test that augmentation with new and increased data is working instead of substitution over the same data?

I hope my noob questions are not too annoying!

reuben · September 12, 2019, 12:24pm

The time taken per epoch does not increase because the input pipeline is concurrent with the training process, while we’re doing a forward/backwards pass on the current batch, the processing of the next few batches is happening concurrently in the background. The back propagation training process itself is slower than the input processing, so there’s some wiggle room where you can add more computation to the input pipeline without increasing time per epoch.

This TensorFlow guide has a more extensive explanation of how this works: https://www.tensorflow.org/guide/performance/datasets