Emotional Text to Speech

brihi · May 4, 2020, 9:45am

Hi guys!

For a university course project, a few of us explored different TTS techniques for generating emotional speech (both HMM based and Deep Learning based). All our experiments are here - https://github.com/Emotional-Text-to-Speech

There is a gap in the literature while trying to fine-tuning pre-trained TTS models (trained on large datasets like LJ Speech) on low resource (emotional) speech data. We tried a lot of approaches – most of them didn’t work out, and we thought that the TTS community could benefit from our findings and build up over these experiments – they are documented over here - https://github.com/Emotional-Text-to-Speech/dl-for-emo-tts

We’ve also released the models for all the approaches we tried (even if they didn’t work) along with their corresponding code for reproducibility purposes, along with some demos that can be played with!

Suggestions and comments are most welcome

kms · May 5, 2020, 3:22am

Hey, great work.

Did you measure how much increase in inference time (if any) there was when comparing emotional speech to neutral speech (in other implementations)?

thllwg · May 5, 2020, 6:34am

Just digged through the repository, amazing work and well documented! Thanks folks!!

justachetan · May 6, 2020, 7:19am

Hi,

Thanks! No we have not done any analysis on inference time. Could you point us to some resources/publications where such analysis has been done? It would be an interesting aspect to study but we are not sure how to go about a rigorous framework for conducting it.

nmstoker · May 6, 2020, 8:01pm

Hi @brihi thank you for sharing this, it looks like excellent work and I especially admire that you list what did not work as well. There’s often a lot of focus on the successful approaches and yet you can learn as much (sometimes more) from what didn’t necessarily work out.

One idea that sprang to mind (and it might need some more sense checking, so take it with a pinch of salt!) would be to explore a classifier for emotional audio and then see whether it could pick out from the larger unannotated datasets particular emotional samples, using that approach to bootstrap a dataset. I suspect that LJ Speech may not have a full range of emotions (given the subjects of the books behind it) but something like LibrITTS might have more range in that respect. Using the speaker encoder here (https://github.com/mozilla/TTS/tree/master/speaker_encoder) to cluster the distinct emotions might be a way to explore whether the emotion can be picked out.

Anyway, thanks again. All the best!

brihi · May 6, 2020, 10:00pm

Hi @nmstoker, thank you so much for the review!

That idea actually sounds very good! This was our first speech related project, so it was a little difficult to understand how and where exactly can we find clean and well structured data. This seems like an interesting approach, thanks again for linking the resources, we’ll look into it and get back!

brihi · May 6, 2020, 10:01pm

Thanks a lot @thllwg !

nmstoker · May 6, 2020, 11:39pm

Also I found this mixed dataset which is a mixture of audio, audio visual and visual data tagged with emotions, so may have some application if you purely focused on the audio parts. I’m AFK right now so I can’t work out the duration of the audio.

brihi · May 7, 2020, 12:13pm

Thanks a lot! We had come across it, but the only problem we were facing was that we require single speaker data for each emotion. EmoV-DB had something ~300 utterances per speaker and per emotion, which was still not sufficient. We’re still looking for more large-scale single speaker datasets.