So I’m working with TTS as part of a personal accessibility software project which reads pdf files and spits out mp3s. Mozilla TTS has been really awesome for my use-case as its the only open source system that I’ve found that sounds nice enough to bear for hours-long audios.
A problem that I’ve been having is that a lot of my input texts have longer, complicated sentences and thus gets clipped by max decoder steps. I looked at the source code for the decoder and while I’m not 100% on how it works, it doesn’t seem like the decoder announces anywhere except to the log that it ran out of steps, nor does there appear to be a good way of predicting how many steps a given sentence will take before running it. In my use-case, a clipped sentence is pretty undesirable, so I’m wondering if there’s any way to automatically split sentences up so that they fit in the model.
My initial thoughts are to predict the number of decoder steps required and split things accordingly (which is approximately what I did for using DistilBERT to improve OCR accuracy), but as I’ve said, I don’t see an immediate way to do that.
Thanks in advance!
edit: I forgot to include which models I use, namely:
I was using tacotron2-DCA before, but for reasons I have yet to understand, that model always ended up glitching out for longer texts (namely by repeating the ends of shorter sentences with increasingly slurred and distorted ways)
The pypi-package TTS and the models you are using are maintained by coqui-ai, you may want to ask your question at https://github.com/coqui-ai/TTS/discussions
Thanks! I’ll ask there