What you can alos do for small RAM GPUs, is to do gradient aggregation. It is not implemented in TTS but it is quite easy to do so. And it’d be a good PR as well.
To be more clear, you run your small batch of instances for n iterations and aggregate the gradients. After you reach N batches, you backprop the model.