Incremental TTS idea (with accompanying paper)

I stumbled on this interesting blog post with accompanying paper from the people at Papercup and thought it might be of interest @erogol

It covers their method for outputting audio incrementally, rather than waiting for the complete input to be processed before returning the audio, which serves the researchers for use in near real-time translation but would also be generally useful because if you can start playing the audio before it’s completely finished then theoretically you could time it to reach the end at the point the very last part of audio was produced and thus the total elapsed time to get to the end would be shorter.

They’ve also got a nice tutorial on declipping audio which might be useful for others but having looked at the audio I’ve got (private and from places like M-AILABS) so far I’ve yet to find any clipped files being used for training. Details are here:

1 Like

Both blog posts - the incremental TTS and the audio clipping - are very interesting.
Did you find references to source code/repository, yet?

fun fact: one of the researcher names in the references sounded familiar to me and indeed he was a fellow student from school/college, now a professor for computational linguistic.


I’ll take a look at this as soon as finishing things up.

1 Like