Data Augmentations

Hi Team,

I had a query about the data augmentations added recently:

Do these augmentations substitute my original spectograms (overwrite after changing), or are they appended to the batches to increase the overall data for training?

Substitute. You should disable feature caching (entirely, remove the .cache() call from the dataset pipeline) when training with augmentation, then the data will be augmented on every epoch and each epoch will use new data.

Thanks for the reply and the much needed augmentation work.
How can we test the increase in data?
We just removed the .cache and ran the training but it shows the same no. of steps and the same time taken per epoch.

I don’t understand your question. The augmentation is performed in-memory, the augmented data isn’t saved anywhere. We still do the same number of steps.

So, if each epoch uses new data, I am expecting an increase in time taken per epoch or an increase in the number of steps. Is this assumption not correct? If so, how do I test that augmentation with new and increased data is working instead of substitution over the same data?

I hope my noob questions are not too annoying!

The time taken per epoch does not increase because the input pipeline is concurrent with the training process, while we’re doing a forward/backwards pass on the current batch, the processing of the next few batches is happening concurrently in the background. The back propagation training process itself is slower than the input processing, so there’s some wiggle room where you can add more computation to the input pipeline without increasing time per epoch.

This TensorFlow guide has a more extensive explanation of how this works: https://www.tensorflow.org/guide/performance/datasets