Hi, I apologize, this feels like a very basic question, but I’ve been poking around github and here and haven’t really gotten a clear answer. How would I actually use this tool to create an mp3 from a piece of text? I don’t particularly care which model it uses. And if there’s any docker images or files, even better.
Not sure if you looked at the wiki but this section will help:
The demo server lets you produce audio. If you really need MP3 you’d need to convert that yourself but it should be fairly straight forward with something like sox or pydub (just Google that bit if you’re unsure)
Okay, I have the dataset in LJSpeech format in Italian, I want to skip training a WaveRNN. I have completed training TTS model.
Now how do I generate speech from text, from what I have seen so far that everywhere in all notebook tutorials it requires a vocoder (like WaveRNN or other voice synthesizer) to be able to generate speech. How do I generate speech using only TTS model and without using vocoder like you said?
Hi @Sadam1195 - If you don’t use one of the vocoders then it will fall back to using Griffin Lim (aka “GL”) It’s not quite as good from a quality perspective but is definitely capable of producing reasonable audio from a good model (so it’s definitely a good place to start).
In your case where you have a trained model, I’d suggest you look at the server folder, figure out how things fit together, set up the config so it uses your model and then run server.py. That’ll bring up the server locally and you can then send an example sentence to it via your browser (that’s easiest; it’s also possible to call it via requests or some other tool, but that’s beyond the scope here).
The line in the config file that mentions it falling back to GL is here:
So you just need to leave that null and update the other settings to point at your model (ie tts_path)