Sample Colab Notebook for faster than real-time speech synthesis on a CPU

https://colab.research.google.com/drive/1u_16ZzHjKYFn1HNVuA4Qf_i2MMFB9olY?usp=sharing

3 Likes

Hi erogol,

Florian here from SEPIA Open Assistant. I thought its better to continue our Twitter discussion here :wink: .

I’ve been playing around a bit more with the Raspberry Pi 4 adaptation of your Colab code. Especially with the threading.

The Pi4 has 4 cores. First I tried to set the used cores via code (torch.set_num_threads and torch.set_num_interop_threads). This does not seem to work as the Pi was always at 400% CPU usage no matter where I put the code.
I searched the web for it and found other users with the same problem. What worked for me was to use OMP_NUM_THREADS=1 python3 at start. After that I saw CPU usage at 100%/200%/etc. depending on the OMP_NUM.

Now for the results. If done each test 3 times and took the middle result (errors are given as very rough estimation).

Test sentence: "Hello this is a test"

1 Core:

Step 1: 3.37 (+/- 0.1)
Run-time: 5.54
Real-time factor: 3.702
Time per step: 0.0001678

2 Cores:

Step 1: 2.92 (+/- 0.1)
Run-time: 4.41
Real-time factor: 2.945
Time per step: 0.0001335

3 Cores:

Step 1: 2.86 (+/- 0.15)
Run-time: 4.24
Real-time factor: 2.83
Time per step: 0.0001284

4 Cores:

Step 1: 2.86 (+/- 0.2)
Run-time: 4.16
Real-time factor: 2.77
Time per step: 0.0001259

“Step 1” is right after the synthesis(...) function, “Run-time” same as in your code.
It looks like there is a small effect but not much. At 3 and 4 cores synth. times tend to be much more random as well.

Now something very strange I noticed. When I use the sentence
"Hello this is a longer test"
synth. time gets very very long :astonished: (2 cores):

| > Decoder stopped with 'max_decoder_steps

Step 1: 54.14
Run-time: 86.44
Real-time factor: 2.483
Time per step: 0.0001126

[EDIT] I just checked the audio and its 3MB (compared to 600KB of the even longer sentence below). It starts perfectly normal then is followed by a lot of silence then some strange artifact that sounds like “lo loooo looooo” :rofl: followed by even more silence ^^.

An even longer sentence like
"Hello this is a longer test to see if threading actually changes anything at all. Ok, let’s go."
looks ok again (2 cores):

Step 1: 12.58
Run-time: 19.22
Real-time factor: 2.732
Time per step: 0.0001239

Do you have any idea whats going on here? :slight_smile:

Regards,
Florian

1 Like

Hey Florian

Welcome to our forum. Happy to carry on our conversation here.

The problem with the longer runtime is that the model fails to stop at the right time and it stops by a threshold value. In general if you don’t use a right set of punctuations in a sentence it is prone to happen. For instance in your sample you are missing the stop sign at the end. It might also happen in complex sentences. But for your sample please try it with the stop at the end.

I think we can try a couple of more optimization tricks to improve the runtime speed like exporting model using pytorch script or using tensorflow backend. I am on a vacation for a week after that I can help on that.

Best :slight_smile:

1 Like

You are right, it works. Actually I’ve seen similar errors in older TTS systems (I think it was in Mary-TTS) that’s why I’m doing a check in my SEPIA code that adds the “.” at the end if its not there ^^.

Looking forward to it :slight_smile: Enjoy your vacation.

cu

Just check the models with TF and TF introduces almost 10% boost in speed. Soon I’ll share a notebook running TF models. It’d be nice if you get to try these models too.