How to generate actual speech

djl0 · June 24, 2020, 12:18am

Hi, I apologize, this feels like a very basic question, but I’ve been poking around github and here and haven’t really gotten a clear answer. How would I actually use this tool to create an mp3 from a piece of text? I don’t particularly care which model it uses. And if there’s any docker images or files, even better.

Thanks a lot, this is such a cool project

nmstoker · June 24, 2020, 1:53am

Not sure if you looked at the wiki but this section will help:

The demo server lets you produce audio. If you really need MP3 you’d need to convert that yourself but it should be fairly straight forward with something like sox or pydub (just Google that bit if you’re unsure)

Sadam1195 · June 24, 2020, 2:02am

@nmstoker I am a bit confused so in order to produce audio, I should also have WaveRNN or any other voice synthesizer ( MelGAN) trained on my personal dataset?

erogol · June 24, 2020, 9:52am

yes you need to train them. Or you can ignore them and use only the TTS model with a small sacrifice in quality.

Sadam1195 · June 24, 2020, 10:55am

Okay, I have the dataset in LJSpeech format in Italian, I want to skip training a WaveRNN. I have completed training TTS model.

Now how do I generate speech from text, from what I have seen so far that everywhere in all notebook tutorials it requires a vocoder (like WaveRNN or other voice synthesizer) to be able to generate speech. How do I generate speech using only TTS model and without using vocoder like you said?

nmstoker · June 24, 2020, 4:24pm

Hi @Sadam1195 - If you don’t use one of the vocoders then it will fall back to using Griffin Lim (aka “GL”) It’s not quite as good from a quality perspective but is definitely capable of producing reasonable audio from a good model (so it’s definitely a good place to start).

In your case where you have a trained model, I’d suggest you look at the server folder, figure out how things fit together, set up the config so it uses your model and then run server.py. That’ll bring up the server locally and you can then send an example sentence to it via your browser (that’s easiest; it’s also possible to call it via requests or some other tool, but that’s beyond the scope here).

The line in the config file that mentions it falling back to GL is here:

github.com

mozilla/TTS/blob/3366328126b329380dcf9d81b064976e9eb96e17/server/conf.json#L6


{
    "tts_path":"/media/erogol/data_ssd/Models/libri_tts/5049/",  // tts model root folder
    "tts_file":"best_model.pth.tar",     // tts checkpoint file
    "tts_config":"config.json",     // tts config.json file
    "tts_speakers": null,           // json file listing speaker ids. null if no speaker embedding.
    "wavernn_lib_path": null,   // Rootpath to wavernn project folder to be imported. If this is null, model uses GL for speech synthesis.
    "wavernn_path":null,  // wavernn model root path
    "wavernn_file":null, // wavernn checkpoint file name
    "wavernn_config": null, // wavernn config file
    "is_wavernn_batched":true, 
    "port": 5002,
    "use_cuda": true,
    "debug": true
}

So you just need to leave that null and update the other settings to point at your model (ie tts_path)

djl0 · June 25, 2020, 7:19pm

@nmstoker Thank you very much, that was a big help!