This might be a bit “off the wall”, but I wondered if anyone had thoughts on how to approach an idea I had regarding TTS.
It’s more at the implementation end than the research end. And it might be somewhat out of scope, but I thought I’d ask.
The speed of TTS can be a bit of an issue in some cases, so I had looked before at caching the audio for complete output sentences. That’s fine (assuming one has fairly plentiful storage) and it doesn’t require significant effort with the TTS demo server to have it write out the wav files.
With highly predictable sentences (like those from many automated processes) it can work effectively. However it isn’t hugely practical where the ability to precisely predict what sentences will be needed is a challenge. For something basic like a sentence with someone’s name, this would often fail (or require infeasible storage) - even if you could take a logical guess about the most common names, unusual names are bound to be required eventually.
This got me thinking if there might be a way to partially cache the generated speech from the beginning up to the parts that differed and then have the model just continue from that point on. This could shave a chunk of time off cases like sentences which had a common start but then featured a name or other hard to predict content.
Sentences of the kind that might benefit are, for instance: "hello ", since you’d only need to process the name part on demand, with the earlier part being pre-processed up front.
What I’m struggling with is how would you be able to get the model to be in the right state to continue. It can’t just start as a normal new sentence or the speech will end up sounding like it’s badly produced concatenative speech.
Q. Anyone know of cases where people have a way of telling a model like this to continue from a particular point?
Q. And is there a name for that?
Unfortunately I wasn’t sure what this would be called - I did try Googling “caching neutral networks” and other variants, but I didn’t spot anything that looked close to what I was trying to do.
Any ideas?