Partial caching of generated speech

This might be a bit “off the wall”, but I wondered if anyone had thoughts on how to approach an idea I had regarding TTS.

It’s more at the implementation end than the research end. And it might be somewhat out of scope, but I thought I’d ask.

The speed of TTS can be a bit of an issue in some cases, so I had looked before at caching the audio for complete output sentences. That’s fine (assuming one has fairly plentiful storage) and it doesn’t require significant effort with the TTS demo server to have it write out the wav files.

With highly predictable sentences (like those from many automated processes) it can work effectively. However it isn’t hugely practical where the ability to precisely predict what sentences will be needed is a challenge. For something basic like a sentence with someone’s name, this would often fail (or require infeasible storage) - even if you could take a logical guess about the most common names, unusual names are bound to be required eventually.

This got me thinking if there might be a way to partially cache the generated speech from the beginning up to the parts that differed and then have the model just continue from that point on. This could shave a chunk of time off cases like sentences which had a common start but then featured a name or other hard to predict content.

Sentences of the kind that might benefit are, for instance: "hello ", since you’d only need to process the name part on demand, with the earlier part being pre-processed up front.

What I’m struggling with is how would you be able to get the model to be in the right state to continue. It can’t just start as a normal new sentence or the speech will end up sounding like it’s badly produced concatenative speech.

Q. Anyone know of cases where people have a way of telling a model like this to continue from a particular point?

Q. And is there a name for that?

Unfortunately I wasn’t sure what this would be called - I did try Googling “caching neutral networks” and other variants, but I didn’t spot anything that looked close to what I was trying to do.

Any ideas?

It’s an interesting idea. I think you can cache the network if the part you want to cache is the beginning of the sentence since the same text would always produce the same output. But after a short tinkering i could not come up with a easy cache scheme that is faster than the computing.

Maybe one way to alleviate the bad artifacts would be training the network in the caching scheme. So you pass random sentences and keep the networks states for the next sentences so network can learn how to continue over. If all these make sense .