What do you see missing in TTS project and how do you think we can improve it?

erogol · July 17, 2019, 1:47pm

Please here share your comments, opinions about TTS and let us discuss together to enable a better open-sourced environment.

geneing · July 18, 2019, 6:24am

Better communication about the roadmap and what features are in active development.
Better information about different active branches - what’s being done and current status.
Backward compatibility with pre-trained models. Currently, we need to checkout old git versions to use pre-trained weights. It may be hard to do with PyTorch.
Better integration with WaveRNN. Maybe include it as a subproject.
Some work on improving deployment, e.g. weight pruning and quantization.

erogol · July 18, 2019, 12:24pm

That is already being shared in projects tab on github. But I guess we need to be more clear.
It is almost impossible with the current rate of experimentation and development. We should have a more stable code-base to promise this.
I am personally happy to keep it as is, at least for now. Having it fully included in TTS increases the complexity which can not be taken with the current head count of the project. If anyone is volunteer on this, that’s always welcome.
All of these are in the TODO list and, if you are interested, contributions are always welcome. However, before worrying about the deployment, I like to fill any algorithmic gap in TTS since any change in the algorithmic level will directly effect the deployment phase.

geneing · July 21, 2019, 5:49pm

It would be nice to have a place to discuss different experiments with improving the network architecture. If we could agree on a standard training/test set and on a standard vocoder, maybe we could compare results of experiments done by different people. Without standard procedure, it’s no possible to tell if the improvement is due to changes to the architecture or due to a better training set.

alchemi5t · July 22, 2019, 6:16am

Very inconsistent between experiments. Might want to check performance vs the tensorflow variants. Keithito’s implementation is quite consistent between various datasets.

I am pro pytorch, but it’s kinda hard to continue development here when results are this inconsistent. I know I haven’t given enough information for you guys to work on, just wanted to make sure you guys compare performance vs the tensorflow variants.

erogol · July 22, 2019, 2:24pm

We can easily discuss experiments on Github issues. If you like to run a new one, you can just create a new issue and we go for it. I try to keep my experiments up-to-date on Github as explained. Let me know, if it is not enough or you have a better way.

For the vocoder, TTS’s default vocoder is GL so it is always easier to experiment using it. All the released models are also with GL vocoder, so if anyone likes to compare results, he can just use the latest release for the target architecture and compare with his version. I guess, it is better to share tensorboard logs as well, if you like to compare loss curves. Does it make sense?

All releases, except one is with LJSpeech, so all benchmarking is going around it.

Let me know, if you have a better way to follow collaborative experimentation.

erogol · July 22, 2019, 2:25pm

what you mean by inconsistent results?

alchemi5t · July 22, 2019, 3:37pm

That sounds pretty good! But I think it’ll be better if we discuss experiments on discourse and leave bug fixes for github issues. GL vocoder is perfectly fine, that’s what is used on the tensorflow implementations as well.

erogol · July 22, 2019, 3:41pm

I just don’t like to force people to have discourse account. We can tag the issues as ‘experiment’, then we can easily filter them.

alchemi5t · July 22, 2019, 3:44pm

By inconsistent I mean, when I use similar datasets to train the tts on the tensorflow variant(I’ll link the repo soon) I dont have to tune any hyperparameters at all there(I.E., I tune it to work well on a standard dataset and once I hit the sweet spot, I train another model on a different dataset with a similar composition including quality(sample rate and a few other params) and that works out wonderfully). But on the mozilla one, I find that, once I tune the hyper parameter to the same standard dataset and then when I try to train a model on another dataset, it takes ages(1-2 days) to get past just producing noise and then it starts sounding like the person’s voice but the content is just gibberish. I am not currently invested into experimenting with the TTS system, so I won’t be able to give you concrete information to go by. I’ll be back in a week or two and I’ll try and explain the situation with a lot more concrete information!

alchemi5t · July 23, 2019, 3:19am

Oh that’s cool too. But since I have one already, might as well build a community here! I did believe having a discourse account was a pain, but I’ve realized helping build a community is actually quite rewarding!

josh_meyer · July 25, 2019, 6:38pm

+1 on the WaveRNN integration