Thankfully I have been able to get the pre-trained model up and running, and producing great synthesized speech.
Some context: I want to animate a face / mouth to speak while the synthesized audio is playing. In order to do this I need the start and stop time of each phoneme in the synthesized speech.
I am wondering if it is possible to use the attention map to extract the timings of then synthesized words? Once I have this I would like to extract the timings of each phonemes…
I would like to analyze the attention map to do this even though I know I could use an acoustic model to calculate this, but this is overkill, and I thought it would be better to find a solution that’s already in the TTS library.
I originally posted on the git hub, and erogol suggested to look at the attention maps. I’m also just wondering if there is a way to get the image / data structure that contains the attention map of a synthesized phrase, and analyze this to get the proper timings.
Thanks for any help!