C++ Implementation of synthesizer for the Tacotron model based on OpenCV capable of running on mobile devices

ljackov · April 9, 2020, 12:59pm

We’ve managed to create a C++ implementation of the Tacotron multi-speaker embedding model based on OpenCV which runs near real-time or faster on contemporary mobile devices. The training is done in Python using the original MozilaTTS implementation and then the data is converted with python script into easier-to-read-from-CPP data format.

The data size of the converted Tacotron model is approx. 28MB of binary data;
in our latest pre-trained model we have two voices simultaneously (male and female).

Our implementation has been tested and runs on Android and iOS.

erogol · April 9, 2020, 10:50am

Amazing honestly. At what level do you use OpenCV? For matrix operations ?

two questions:

do you have any samples to share for the speech quality?
do you have any plans to contribute this to TTS repo? We also work on a mobile capable vocoder model with a contributor friend based on Tensorflow. We could merge these two.

ljackov · April 9, 2020, 5:16pm

We use OpenCV’s dnn library, however we had to code several layers that were not implemented, the most notable being GRUCell, as well as the tacotron-related layers (normaly OpenCV dnn doesn’t support iterative models).

Our current implementation is for Bulgarian, we have a sample (unfortunately not of our latest voice) at http://tts.skycode.com . You have to enter text in Bulgarian, but you can copy the page title for a voice sample. However, any language can be synthesized if there is a trained model for it.

The quality is virtually the same as from the python implementation;
currently we’re badly struck by the coronavirus crisis, so we hope to have some sort of financial returns from sharing our expertise ;(

erogol · April 9, 2020, 12:35pm

that sounds really good. Is this only the TTS model or with a vocoder?

PS. Good luck with the virus ans its financial boom.

PS2. putting your code into TTS does not promise any financial return of course but contributing to a well-known repo could be useful to promote your name and work.

ljackov · April 9, 2020, 12:57pm

Have a look at the link – it outputs a .wav file. We currently use the standard Griffin-Lim vocoder algo. The implementation behind the web interface is a standalone linux binary compiled from our code.

We will consider adding the code to the repo.

erogol · April 9, 2020, 1:43pm

Yeah I tried the site already. Result sounds really good even though it is just GL. Good work !

carlfm01 · April 9, 2020, 4:53pm

I confirm this.

Mozilla a bit ago recognized my work with mentions even on Mozilla Hack, people keep contacting about .NET collaborations for STT, there are too many that I CAN’T accept them all.

This is a current development for a lot of companies and people, there’s a high chance of getting a lot of projects with a bit of visibility.

kms · April 24, 2020, 11:41pm

Have you posted the code somewhere?
Have you looked at the FastSpeech-based implementations, as well?
The link you posted seems to be giving authorization errors.

ljackov · April 27, 2020, 9:04am

Hello,

No, the code hasn’t been published anywhere yet;
No;
The link is plain http, which is something some browsers moan about.

The code is compatible/works with with the models trained by the Tacotron implementation of MozillaTTS.

kms · May 5, 2020, 2:54am

That’s very interesting. Are you planning to release it with an open source license?

ljackov · May 5, 2020, 7:08am

We haven’t decided yet.

Oymate · August 31, 2020, 10:58am

Any apk file anywhere?

ljackov · September 23, 2020, 8:57am

The only app it’s currently implemented in is Wildmaps ( https://play.google.com/store/apps/details?id=com.wildmaps )

Please note that only the Bulgarian voice uses it.