For a university course project, a few of us explored different TTS techniques for generating emotional speech (both HMM based and Deep Learning based). All our experiments are here - Emotional Text-to-speech · GitHub
We’ve also released the models for all the approaches we tried (even if they didn’t work) along with their corresponding code for reproducibility purposes, along with some demos that can be played with!
Thanks! No we have not done any analysis on inference time. Could you point us to some resources/publications where such analysis has been done? It would be an interesting aspect to study but we are not sure how to go about a rigorous framework for conducting it.
Hi @brihi thank you for sharing this, it looks like excellent work and I especially admire that you list what did not work as well. There’s often a lot of focus on the successful approaches and yet you can learn as much (sometimes more) from what didn’t necessarily work out.
One idea that sprang to mind (and it might need some more sense checking, so take it with a pinch of salt!) would be to explore a classifier for emotional audio and then see whether it could pick out from the larger unannotated datasets particular emotional samples, using that approach to bootstrap a dataset. I suspect that LJ Speech may not have a full range of emotions (given the subjects of the books behind it) but something like LibrITTS might have more range in that respect. Using the speaker encoder here (https://github.com/mozilla/TTS/tree/master/speaker_encoder) to cluster the distinct emotions might be a way to explore whether the emotion can be picked out.
That idea actually sounds very good! This was our first speech related project, so it was a little difficult to understand how and where exactly can we find clean and well structured data. This seems like an interesting approach, thanks again for linking the resources, we’ll look into it and get back!
Also I found this mixed dataset which is a mixture of audio, audio visual and visual data tagged with emotions, so may have some application if you purely focused on the audio parts. I’m AFK right now so I can’t work out the duration of the audio.
Thanks a lot! We had come across it, but the only problem we were facing was that we require single speaker data for each emotion. EmoV-DB had something ~300 utterances per speaker and per emotion, which was still not sufficient. We’re still looking for more large-scale single speaker datasets.