We’ve managed to create a C++ implementation of the Tacotron multi-speaker embedding model based on OpenCV which runs near real-time or faster on contemporary mobile devices. The training is done in Python using the original MozilaTTS implementation and then the data is converted with python script into easier-to-read-from-CPP data format.
- The data size of the converted Tacotron model is approx. 28MB of binary data;
- in our latest pre-trained model we have two voices simultaneously (male and female).
Our implementation has been tested and runs on Android and iOS.