Running TTS on constrained hardware (+ no GPU)

Has anyone looked at the practicalities of running this TTS inference on constrained hardware, such as a mobile phone / Raspberry Pi?

I haven’t got to the point of trying this myself yet but it would be useful to hear if anyone tried it and/or if it’s on the road map for the project.

I’m assuming the inference time would be measurably longer, if it’s possible at all - of course, maybe not having a GPU would be a deal breaker (??)

If it weren’t exceptionally slow it might still be reasonably usable for a number of scenarios as it’s fairly easy make the Demo server cache results (helpful where the bulk of your responses are typically from a common set of spoken output, which wouldn’t need inference after the initial time)

2 Likes

I don’t think you can go down as much as RasPi, but if you don’t use a neural vocoder with Tacotron architecture, TTS is able to reach real-time on CPU.

Nevertheless, our ultimate goal is to optimize all the code to be able to run on low resource systems. Any contribution ahead is also always welcome.

1 Like

Inference from text to mel spectrograms runs fine on an older 4 core Intel cpu. It’s about 10x faster than real time. There are many tricks that can be used to speedup and reduce memory use further - pruning weights, quantization to int16 or even int8. This should make it fast enough to run on a high end phone.

WaveRNN type vocoders are fast enough for real time synthesis on a laptop CPU and may be fast enough on a mobile with some additional optimization.

However, while it may run fast enough it may drain the battery too fast for some applications (e.g. audiobooks).

Hi!

We’ve managed to create a C++ implementation of the Tacotron multi-speaker embedding model based on OpenCV which runs near real-time or faster on contemporary mobile devices. The training is done in Python using the original MozilaTTS implementation and then the data is converted into easier-to-read-from-CPP layout.

2 Likes

Can you point to any examples of executable code that does this? I’m very interested in running TTS on constrained systems.

Quick update: I’ve managed to get TTS up and running on a Raspberry Pi 4 as I was curious to see if it was possible. I’m planning to write up the details soon.

In essence it’s slow (as expected) but could just about be feasible for batched speech generation (ie not live response scenarios but where you’re generating output from a chunk of text overnight, such as reading out a report that you’d want to listen to in the morning)

It uses the released Tacotron2 model that’s part of the TTS model wheel but without the PWGAN part (haven’t looked at that yet). Therefore it uses Griffin Lim. The quality seems good although a couple of sounds occasionally seem a little off (as if there’s something up with the phonemes being passed or something about the model being used with the current branch, as I ended up installing from master which may have some differences to the branch at the point the model was packaged)

Am producing some timings now but so far it varies between 0.6s and 1.5s per character (which was more easily measured that the time of the output audio; back of an envelop calculations suggest one hour would take around five to six hours to produce).

4 Likes

Hello Neil, I’m very interested to run TTS on my Rpi4. Did you write some instructions or highlights I could go trough?
Many thanks.

Hi @carduque - I definitely plan to and took detailed notes as I did it. Over next evening or two I just need to confirm if a couple of steps were strictly necessary (as I didn’t get it all 100% right first time) - when I’ve confirmed I’ll post the details

Also to manage expectations, it is pretty slow, but like I said there could be reasonable batch style applications

1 Like

Can’t wait for the writeup. Cos I’m doing pretty much the same, except my CPU isn’t as constrained as a Pi4

Soon I’ll release Tensorflow model with a script to convert our latest torch model. So maybe you can consider to use TF-Life for a better perf on raspi.

Yes, that sounds great. TF-lite allows really impressive performance for DeepSpeech on RPi

I’ve got a little behind with the write up but it’s mainly showing how to get round a couple of non-obvious installation issues with a couple of packages. It’s a national holiday here tomorrow, so a long weekend, meaning it’ll definitely get done by Sunday :slightly_smiling_face:

Sorry for the bluntness, but soon as in when?

Awesome, thank you. Can’t wait

@nmstoker Any updates? I ran into the same problem. I would like to do TTS on a Raspberry Pi. Using TF Lite or OpenCV and read out loud as it detects.

I was/am considering:

Festival, Flite: A small fast portable speech synthesis system

SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis (A lighter version of WaveGlowish)

WaveNet
A TensorFlow implementation of DeepMind’s WaveNet paper

Tacotron
An implementation of Tacotron speech synthesis in TensorFlow.

WaveGlow
A Flow-based Generative Network for Speech Synthesis

Tacotron 2? are you using Google’s? or NVIDIA’s?

I was trying to make it light/fast over customization or handwritten recognition.
If anyone has experience please let me know thoughts/pros/cons

Of those, flite might work on a pi for something maybe neartime. The rest would benefit greatly from having more resources.

1 Like

Got it! Do know if flite would integrate with TF Lite vs full TF? I am also using Coral (Edge TPU)

Flite does not use tensorflow.

as in “don’t do any plans depending on when I release”

so can I say TTS is 6x slower than real-time on Raspi?

1 Like

Yes, that’s a fair estimate :slightly_smiling_face: