I stumbled on this interesting blog post with accompanying paper from the people at Papercup and thought it might be of interest @erogol
It covers their method for outputting audio incrementally, rather than waiting for the complete input to be processed before returning the audio, which serves the researchers for use in near real-time translation but would also be generally useful because if you can start playing the audio before it’s completely finished then theoretically you could time it to reach the end at the point the very last part of audio was produced and thus the total elapsed time to get to the end would be shorter.
They’ve also got a nice tutorial on declipping audio which might be useful for others but having looked at the audio I’ve got (private and from places like M-AILABS) so far I’ve yet to find any clipped files being used for training. Details are here: https://engineering.papercup.com/posts/declipping/