What are the options for someone without a proper GPU? Cloud services, VMs or external GPUs?

Hello everyone,
I want to train my first modell with 35 hours of training data on a laptop with 16 GB RAM, a 2,6 GHz 6-Core Intel Core i7 processor and no proper GPU (Intel HD only)

I don’t have much experience with Tensorflow so some concepts like epochs are still a little vague for me. Maybe I can solve my problems with more arguments for the script.

I let DeepSpeech.py run over the night with only train_files, dev_files, test_files and export_dir as an argument. After 14 hours it hasn’t finished the first epoch.

Is this normal? How can I optimize this? A few thoughts I have:

  • I am thinking about renting a cloud service for training, does anybody has tips for that? Wich one works well and fits into the budget of a mere hobby?
  • Another way to solve this is running the training inside a VM, so I can stop the process and let it run in intervalls over weeks with 70% of my processor during normal usage. (I already did this with other projects that took very long)
  • Maybe a external GPU could be a good investment. Does anyone have made experiences with that?
  • Or I just let it run on an old laptop in a corner for weeks, it would be an 2015 i5 processor with 14 GB RAM.

Wich option would you prefere? Or did I forget about another way to solve this?

Given your setup, yes.

As you said, GPUs :slight_smile:

Several people are doing that, but we don’t have experience in this, we have our own hardware, so I’ll let other comment their feedback here.

That’s adding a layer of indirection, and likely going to kill your performances (IO especially, if you don’t setup VM properly). TensorFlow has checkpoints support, so if you interrupt it and it has made enough progresses, it will be able to restart from there.

Connection to your PCIe subsystem is going to be the main issue here. I don’t know how CUDA behaves with external GPU, and we’ve got report (but never could verify for sure) from people that slow PCIe / extra PCIe adapter would slow a lot down training.

Training on CPU with even your amount of data is going to take more than weeks.

You have not talked about money here, and that’s not a small question. How much can you afford ?

Also, it’s important to get more context: what are you working on, why are you training that model ? Can you unite efforts with some other people if you are working on a community-level language ?

1 Like

Hey lissyx thanks for your answer,

This sounds good, I will look into this. There hasn’t been a single file being written to the folder during the training process, so I think I hadn’t made enough process yet. Do you know a way to save checkpoints earlier?

I am willing to pay 20-50 € per month for a cloud service. Maybe 300€ for a GPU. I don’t own a desktop PC.

First, to learn about the technology because I am curious. Second to motivate others to donate more to the dataset(s). Right now I am experimenting with the Esperanto dataset. This language looks especially interesting for neural networks since it is completely regular and has a clear one sound per letter alphabet. I am curious if this also means better results in neural networks compared to natural languages. I would also love to experiment with the German dataset but this is out of my technical league right now.

For the future I would love to have a system that is able to transcribe Esperanto Podcasts from people who already donated to Common Voice. I think I won’t get further than a system that can transcribe voices from donators. As I understand it for a general purpose system one needs tenth of thousands of different voices.

Good idea, I will contact others who worked on the project. I’ve already talked to The Esperantic Studies Foundation to collect public domain sentences, maybe they are also willing to support this project. I will also contact some other organizations. I just thought it might be good to have some results before contacting others so I can say more about questions like how realistic a working system can be in the coming years and how much work it will be.

Edit: maybe I just start with my own voice donations as a start and use the full dataset when I know more about the system. This could even be done with German, I “just” have to find my ID in the dataset.

FWIW, since you’re mentioning external GPUs specifically, I haven’t seen any reports on setups like that, so you might be breaking new ground. I can imagine the limited bandwidth being a problem.

1 Like

50€ might be a bit low.

How big is this one ?

eGPU are tricky, this might still be a bit low of a budget, but that might be much more efficient / flexible. What’s your laptop ?


eGPU casing + power supply by itself is already ~300€. While training with 35h is not going to require the most powerful GPU, that kind of solution is mostly designed for high-end GPUs so it’s expensive :confused:

Thanks for the info. I have access to a MacBook Pro 2018 and a Thinkpad from 2015. So I could conect to a eGPU via USB-C. But I belive a cloud solution works best for me, I will keep you updated about that.

Just to bring things into perspective: how long would the training with 35 h take with a modern GPU? Would it be days or still weeks?

For now I will experiment with smaller datasets of my own voice to learn more and have something to show to others. Most likely I will invest money for a full training after the next release of the dataset in august or so.

Are you running Linux or macOS ? There’s no CUDA support on TensorFlow for macOS, so you won’t be able to.

That depends on the definion of “modern GPU” you have as well as your training parameters, but with 35h and 20 epochs (likely too much), it’s very likely to be < 1h.

With 2x RTX2080Ti, I’m training 15 epochs on 650h in ~7h, more or less 18min per epoch.

Good to know, I will think about bootcamp, but this looks just like another argument for a cloud solution or a desktop PC.

This sounds promising, maybe it’s time to invest in a gaming pc :slight_smile:

It’s likely this would be more efficient, but it’s also more expensive. I guess you need to check how fast can cloud-based run, and check costs of both.

1 Like

There are additional arguments you can use to speed things up, such as —use_cudnn_rnn (requires a GPU) and —automatic_mixed_precision=True.

Using CPU alone might take too long, even a small GPU will boost your training a lot. I have read about people using Google colab TPU to train, but that’s maybe a bit tricky.

If you have some money available, you should get a Google, MS, AWS VM with just one GPU. 35h input with 1 V100 should run an epoch in ca. 10 minutes. As it costs about 2.50 USD an hour, that could be a a feasible option.

Try @agarwalaashish20 's repo to get you started. Definitely increase batch sizes (2, 4, 8), that might even boost CPU for you now. And use TF 1.15 for 0.6.0, that worked well for me.

Beware, 35 hours is not much for general language, might work for specific stuff.

2 Likes

I will try that. Are you talking about this repo: https://github.com/AASHISHAG/deepspeech-german ?

I am aware of that, I mainly want to do this to learn and to get a feeling about the results. At which point can I expect better results? I think if I consider both the sentence collection speed and the resonance in the community the best that I can expect until August is 150 h from 1000 people. Would that be enough for a first alpha of a general purpose voice recognition system that could be usable for people who donated to the dataset or is 1 000 hours really the minimum?

Yep, deepspeech-german is a good start :slight_smile:

As for the amount necessary, it depends how good the results should be. I would argue that you get a decent model for the common voice input. But I guess someone will build a German model this year with a lot more that you could then refine (Transformer) to your needs.

1 Like

For reference, with mixed sources on French I achieved to get around 650h and this is producing good enough results for demo-ing, and getting people on board.

1 Like

I finally found a service that is usable without a credit card and payable for me. I now use a machine with two 1080Tis in the cloud on

They are a little more expensive than others but I won’t use them for weeks, so this is okay for me.

If you just want a German model, why not try this one first. Is on 0.6 code base, but fresh :slight_smile:

I want to create a model for Esperanto (plus I want to learn the technology).

But I will also try out the German model, of course. How good is it already?

Ah, yes I remember. Quality is really good for speaking slowly and clearly, worse for accent and or fast.

1 Like

Nice, can’t wait until Firefox Voice and Mycroft support deepspeech. I assume the quality is better for people who actually donated to the dataset right?

I know that the 35 h (and soon 70h) of Esperanto won’t bring good results. I see this as a learning project and if I can create a network that is somewhat functional this can also be used to motivate people to donate more audio.