Extract timing of phonemes and words from attention map

Hey all!

Thankfully I have been able to get the pre-trained model up and running, and producing great synthesized speech.

Some context: I want to animate a face / mouth to speak while the synthesized audio is playing. In order to do this I need the start and stop time of each phoneme in the synthesized speech.

I am wondering if it is possible to use the attention map to extract the timings of then synthesized words? Once I have this I would like to extract the timings of each phonemes…

I would like to analyze the attention map to do this even though I know I could use an acoustic model to calculate this, but this is overkill, and I thought it would be better to find a solution that’s already in the TTS library.

I originally posted on the git hub, and erogol suggested to look at the attention maps. I’m also just wondering if there is a way to get the image / data structure that contains the attention map of a synthesized phrase, and analyze this to get the proper timings.

Thanks for any help! :smile:

It is not an easy problem. You can get some insights from this paper
" Phonemic-level Duration Control Using Attention Alignment for Natural Speech Synthesis"

@joshua.eisenberg - did you ever get anywhere with this?

It’s an interesting thing to look into and I was thinking about this earlier.

If you have a look at this notebook you’ll be able to explore the alignment charts - if I’m interpretting them right then one dimension of “alignment” (the np array) is the phonemes and the other is the timesteps in the audio output.

This isn’t (yet) quite what you seek but if you could match the text word starts and ends to their corresponding phonemes then you would be able to turn that into approximate timings based on the attention map line. Thus, for instance, if you know the second word starts at phoneme symbol eight, you effectively read across from eight to the attention line and then see what point in the time line that is equivalent to and that should be approximately the start time of the second word.

1 Like

and if you like to do this for the real data you need to run the model in teacher forcing mode (providing real spectrograms to prenet as in training pass) so align the real audio with the attention alignment.

1 Like