Training with a 2GB GPU

I am using a GeForce GTX 960.
That’s a 2,0 GB GPU.
I had problems with it when using a batch size of 32. Even 16. Even 8 and even 2.
Now people here say that with batch sizes lower than 32, it’s unlikely that tacotron2 will ever convert.
I tried colab, but they started restricting GPU more. I haven’t been able to use the GPU for more than 6 days.

How could you make the 2GB setup convert?

Or do I have to use the CPU only?

Is there a way to assign part of what fits into the 2GB to the GPU and do the rest on CPU?

I think 2Gb would be way too little to train with on a GPU - as you’ve seen, the batch size becomes a problem. BTW, I have had some reasonable results with a batch size of 16, so that point about 32 is more advice than a hard limit, but clearly that isn’t going to help you here unfortunately.

I don’t believe there’s any realistic way to train partly on the GPU and partly on the CPU (that would be down to Pytorch really).

The problem you’ll run into is that training on the CPU instead will be extremely slow - I’ve never done it properly, but it did happen once in error and I realised because it made almost no progress. This sort of model training simply suits the parallel capabilities of modern GPUs much better.

1 Like

Out of curiosity i started a Taco-Training on a 2.3 GHz i7 Quad-core and it was awfully slow, 10 or more seconds per step.


This will take 11+ days (on 100% cpu usage) if you wanna reach 100k training step which is something like a minimum for getting (imho) an acceptable tacotron 2 model.

That seems like a task for distributed training 101.

Or get a cloud GPU for a day or two, or a better GPU locally. Set up a re-spawning pre-emptible instance with a gpu, and you can train pretty easily.

1 Like

I found this sample for a tensorflow based tts that delegates to cpu and gpu simultaneously.

How do I need to change the torch setup of mozilla tts for this to work:

import tensorflow as tf
import numpy as np
from time import time
from threading import Thread

n = 1024 * 2

data_cpu = np.random.uniform(size=[n//16, n]).astype(np.float32)
data_gpu = np.random.uniform(size=[n    , n]).astype(np.float32)

with tf.device('/cpu:0'):
    x = tf.placeholder(name='x', dtype=tf.float32)

def get_var(name):
    return tf.get_variable(name, shape=[n, n])

def op(name):
    w = get_var(name)
    y = x
    for _ in range(8):
        y = tf.matmul(y, w)
    return y

with tf.device('/cpu:0'):
    cpu = op('w_cpu')

with tf.device('/gpu:0'):
    gpu = op('w_gpu')

def f(session, y, data):
    return, feed_dict={x : data})

with tf.Session(config=tf.ConfigProto(log_device_placement=True, intra_op_parallelism_threads=8)) as sess:

    coord = tf.train.Coordinator()

    threads = []

    # comment out 0 or 1 of the following 2 lines:
    threads += [Thread(target=f, args=(sess, cpu, data_cpu))]
    threads += [Thread(target=f, args=(sess, gpu, data_gpu))]

    t0 = time()

    for t in threads:


    t1 = time()

print t1 - t0

Also, how does it influence the total length of the sentence and size of dataset?

Will this fill the 2GB and use the CPU and RAM for the rest or will it just delegate evenly as if I had two time 2GB GPU?

The code looks more like a benchmark script to me

Then make it work. You are like the guy who comes to an island, does not see a hotel with a shower and wifi there and leaves instead of thinking: “What could I do to build a hotel with a shower and wifi here?”.

Why would he have any interest in fixing your problem for free?

You’ve also mistaken him in your analogy for yourself.

Because it is not my problem, but everybody’s problem. If everybody could use cpu and gpu plus their memory together, we might not be dependent on colab and accomplish much more in shorter time.

We could use a cuda gpu + its memory + cpu + memory + swap file + motherboard gpu(!) + its memory.

At. The. Same. Time.

Now we are using either gpu or, if that doesn’t work, the much slower cpu.

A lot of processing power is not being used because of that.

My analogy can not be applied to me, because I am looking for solutions that are not here, he is just destroying the immature idea that is there without offering anything himself.

Besides, I offered the source code and asked important questions to solve this - what are you offering to solve the problem?

Here’s your actual solution: get access to a better (8gb+) gpu and save everyone a huge amount of trouble. You’re on the internet, you have access, now you just need to choose to make use of it. Trying to hamstring yourself even more like you’re doing above will definitely serve my amusement and your lack of progress, so if that’s what you’re after keep it up.


You are being unreasonable here. Leave this thread together with dkreutz to not deter people who are willing to build something.

You asked for a solution. You got one.

You choosing not to make progress isn’t dkreutz’s or my issue. Your perception of what’s going on here seems to be rather skewed, in fact. People keep trying to help you, and you keep being belligerent when you’re not getting an answer you want. It’s certainly a good way to not get help, not make progress, and alienate anyone who could do something for you. I will certainly enjoy and continue to comment on your floundering until you choose to help yourself out, of course. :slight_smile:

You stultify yourself, baconator. Don’t make it worse for yourself. Leave the thread.

Do let us know when you start to try fixing your own problems. :smiley:

1 Like

As i see myself within the amount of “everybody” i honestly have to disagree. I don’t have the problem, so “everybody” is proven wrong.

But like @baconator already said. Please think on investing in a better gpu. In my opinion it’s a much better and reliable option than putting all gpu + cpu + memory + swapfile + floppy disk + … resources together.


I am not convinced that this will be worthwhile for training scenarios as the CPU contribution will be minimal and it would likely add substantial complexity.


Floppy disk?

Summary of both + dkreutz:
I have no idea how and if that works, but I think it will not be well performing based on my lack of experience with it and therefore I need to inform everybody here that I feel that way. And I will libel every approach that tries to come closer to it based on that prejudice.

Which makes me wonder: what if the creators of tacotron had thought that way?

Presumably everyone has a lack of experience with things that have never been done before!

I’m not saying no one should try it, I’m merely of the opinion that it’s not worth it - we can already determine that the performance contribution to training would be minimal because of what we observe with training solely on the CPU.

I couldn’t definitively comment on the complexity but it seems a fair assumption that it would add to it.

If you find evidence to suggest it will make a larger contribution to performance and you find evidence that it can be implemented simply then I suggest you make a case that it should be looked at. So far I’m not seeing anything that stands up to scrutiny.

1 Like