C++ Implementation of synthesizer for the Tacotron model based on OpenCV capable of running on mobile devices

We’ve managed to create a C++ implementation of the Tacotron multi-speaker embedding model based on OpenCV which runs near real-time or faster on contemporary mobile devices. The training is done in Python using the original MozilaTTS implementation and then the data is converted with python script into easier-to-read-from-CPP data format.

  • The data size of the converted Tacotron model is approx. 28MB of binary data;
  • in our latest pre-trained model we have two voices simultaneously (male and female).

Our implementation has been tested and runs on Android and iOS.

1 Like

Amazing honestly. At what level do you use OpenCV? For matrix operations ?

two questions:

  • do you have any samples to share for the speech quality?
  • do you have any plans to contribute this to TTS repo? We also work on a mobile capable vocoder model with a contributor friend based on Tensorflow. We could merge these two.

We use OpenCV’s dnn library, however we had to code several layers that were not implemented, the most notable being GRUCell, as well as the tacotron-related layers (normaly OpenCV dnn doesn’t support iterative models).

Our current implementation is for Bulgarian, we have a sample (unfortunately not of our latest voice) at http://tts.skycode.com . You have to enter text in Bulgarian, but you can copy the page title for a voice sample. However, any language can be synthesized if there is a trained model for it.

  • The quality is virtually the same as from the python implementation;
  • currently we’re badly struck by the coronavirus crisis, so we hope to have some sort of financial returns from sharing our expertise ;(

that sounds really good. Is this only the TTS model or with a vocoder?

PS. Good luck with the virus ans its financial boom.

PS2. putting your code into TTS does not promise any financial return of course but contributing to a well-known repo could be useful to promote your name and work.

1 Like

Have a look at the link – it outputs a .wav file. We currently use the standard Griffin-Lim vocoder algo. The implementation behind the web interface is a standalone linux binary compiled from our code.

We will consider adding the code to the repo.

Yeah I tried the site already. Result sounds really good even though it is just GL. Good work !

I confirm this.

Mozilla a bit ago recognized my work with mentions even on Mozilla Hack, people keep contacting about .NET collaborations for STT, there are too many that I CAN’T accept them all.

This is a current development for a lot of companies and people, there’s a high chance of getting a lot of projects with a bit of visibility.

1 Like
  1. Have you posted the code somewhere?
  2. Have you looked at the FastSpeech-based implementations, as well?
  3. The link you posted seems to be giving authorization errors.

Hello,

  1. No, the code hasn’t been published anywhere yet;
  2. No;
  3. The link is plain http, which is something some browsers moan about.

The code is compatible/works with with the models trained by the Tacotron implementation of MozillaTTS.

That’s very interesting. Are you planning to release it with an open source license?

We haven’t decided yet.