Running TTS on constrained hardware (+ no GPU)

If it is the case, I don’t think any other model, even fastspeech would work in real-time in raspi. At least, without any additional optimization. Fastspeech is very computation heavy although it is structurally feed forward.

I’ve put detailed instructions on how to install it on an RPi4 here: https://medium.com/@nmstoker/installing-mozilla-tts-on-a-raspberry-pi-4-e6af16459ab9

I actually recorded the install end-to-end with asciinema with the intention of posting that too but then ran into difficulties as the file is over their limit and converting it to a video (which itself is kind of against the intent of asciinema) also caused problems, but rather than hold it up for that, I’ve posted this now.

There could be some refinements in the approach, but I know for sure that this works with Feb 2020 Buster on an RPi4 4Gb (having had a complete run through to confirm my cut down version was okay and then again to record the terminal session!!)

Would be interested to hear how people get on if they give it a go :slightly_smiling_face:

4 Likes

Amazing work, thanks a lot @nmstoker :smiley: .

1 Like

I added this to the project wiki under examples

Hey @_CA_A, have you tried SqueezeWave? They promisse a lot of speed up. They also have something to try on GitHub. However, I couldn’t install the requirements. I’ve tried with Python 3.6, Pyhton 3.7 and Python 3.8.

I think that they might be the best hope for an ARM TTS right now…

I don’t know one-to-one comparison but melgan can also run on raspi and probably easier to train.

Thanks to your great documentation i’ve setup tts server on a raspberry pi 3 model b rev 1.2.

@nmstoker What about some test sentences to have a performance comparison?

  • Hello, how are you?
  • This phrase could be spoken because of mozilla tts project and it’s great community.
  • free to use text to speech and speech to text by common voice is important for future
  • Hello Neil, how about a little raspberry performance battle?
1 Like

Raspberry pi 3 model b rev 1.2:

  • Hello, how are you? —> 37 seconds
  • This phrase could be spoken because of mozilla tts project and it’s great community. --> 100 seconds
  • free to use text to speech and speech to text by common voice is important for future —> 100 seconds
  • Hello Neil, how about a little raspberry performance battle? —> 84 seconds

Very good!

I expect using the TF version will see quite a speed boost on RPi.

Also, this reminds me, I need to follow up to see what needs to be done to get llvmlite compiled into a wheel on piwheels (for RPi) which in turn should smooth along installing librosa. Details here: https://github.com/piwheels/packages/issues/33

what model did you use for this?

I used the model described in the article by @nmstoker.

Has any tried the Facebook cpu based TTS https://github.com/facebookarchive/loop

Training is all still Cuda but runs on CPU just wondered what CPU(s)

Sorry I haven’t tried it and couldn’t see mention of detail regarding CPU compatibility in the repo from a quick skim of it.

If others know they can chip in, but this seems like it could be verging on “off topic” (discussion of other repos does happen here, besides the main TTS one, but that tends to be for comparison/background knowledge or regarding use with/integration with TTS here)

Whether you get a response here or not, bear in mind that that repo is archived so my guess is the original authors have moved on to new things, any support of their repo may be limited. Best of luck all the same :slightly_smiling_face:

1 Like

What you seek is seeking you: Check their paper.

This made me think about creating my own saas for myself using my server to do the calculation.
As in I send a request containing the phrase and I get the result back as a wave file.
Then the power of the device - in my case rpi1 - does not matter, just the internet connection.
Has anyone done that, yet so I don’t reinvent the wheel?

There are lots of ways to do this kind of thing if you want your personal SaaS available to you anywhere, but it’s trivial with something like Ngrok You run server.py and then point Ngrok at it.

Bear in mind the server’s more of a demonstration server than a production ready setup although it is workable. You’d want to take care to secure things in whatever method you use to access the server.

It seems that I don’t need ngrok if I have a php-server already running, right?

Ngrok is to make the service publicly available. If you’ve got a server on the internet then you’re fine to use that instead. And of course if you just want to access it all locally then you don’t need it either.

I would connect to my server using the internet with my pi that is here. I am already doing that with ssh-keys to fetch logfiles regularily over a ssh connection.

1 Like

Our TTS models can run on one CPU thread / core decently

Please see our TTS models here - https://github.com/snakers4/silero-models#text-to-speech (corresponding article https://habr.com/ru/post/549482/)

Just let me repost some of the benchmarks here:

  • RTF (Real Time Factor) - time the synthesis takes divided by audio duration;

  • RTS = 1 / RTF (Real Time Speed) - how much the synthesis is “faster” than realtime;

We benchmarked the models on two devices using Pytorch 1.8 utils:

  • CPU - Intel i7-6800K CPU @ 3.40GHz;

  • GPU - 1080 Ti;

  • When measuring CPU performance, we also limited the number of threads used;

For the 16KHz models we got the following metrics:

| BatchSize | Device        | RTF   | RTS   |

| --------- | ------------- | ----- | ----- |

| 1         | CPU 1 thread  | 0.7   | 1.4   |

| 1         | CPU 2 threads | 0.4   | 2.3   |

| 1         | CPU 4 threads | 0.3   | 3.1   |

| 4         | CPU 1 thread  | 0.5   | 2.0   |

| 4         | CPU 2 threads | 0.3   | 3.2   |

| 4         | CPU 4 threads | 0.2   | 4.9   |

| ---       | -----------   | ---   | ---   |

| 1         | GPU           | 0.06  | 16.9  |

| 4         | GPU           | 0.02  | 51.7  |

| 8         | GPU           | 0.01  | 79.4  |

| 16        | GPU           | 0.008 | 122.9 |

| 32        | GPU           | 0.006 | 161.2 |

| ---       | -----------   | ---   | ---   |

For the 8KHz models we got the following metrics:

| BatchSize | Device        | RTF   | RTS   |

| --------- | ------------- | ----- | ----- |

| 1         | CPU 1 thread  | 0.5   | 1.9   |

| 1         | CPU 2 threads | 0.3   | 3.0   |

| 1         | CPU 4 threads | 0.2   | 4.2   |

| 4         | CPU 1 thread  | 0.4   | 2.8   |

| 4         | CPU 1 threads | 0.2   | 4.4   |

| 4         | CPU 4 threads | 0.1   | 6.6   |

| ---       | -----------   | ---   | ---   |

| 1         | GPU           | 0.06  | 17.5  |

| 4         | GPU           | 0.02  | 55.0  |

| 8         | GPU           | 0.01  | 92.1  |

| 16        | GPU           | 0.007 | 147.7 |

| 32        | GPU           | 0.004 | 227.5 |

| ---       | -----------   | ---   | ---   |
1 Like