Audio preprocessing

CrazyJoeDevola · September 6, 2020, 9:37am

Hi

Does anyone have experience on how much consistent normalized audio have on the final output?
What level are you normalizing at? - 6db?

dkreutz · September 6, 2020, 5:44pm

You can configure TTS/Tacotron to do preprocessing and normalize audio, so there is no need to do that yourself. But what is helful is to have consistent average loudness. I preprocessed our dataset with RNNoise and pyloudnorm (latter with an average loudness of -25dB, at bit lower than usual to avoid clipping artefacts).

CrazyJoeDevola · September 7, 2020, 6:47am

Thanks. That probably makes it easier. I use desktop software (Audacity) to process all files right now.

What about other noise and background rumble in the audio, to what extent will that affect the final TTS?

dkreutz · September 7, 2020, 7:02am

Audio signal should be as clean as possible. TTS model can not distinguish between speech signal and unwanted background noise and will learn e.g. room echo noise as part of the signal otherwise - which leads to echoing and/or robotic voice sound.

As already mentioned RNNoise removes a lot of noise very effectively. In addition you can do lowpass (e.g. 8000Hz) and highpass (e.g. 50Hz) filtering and set fmin/fmax parameters in config file.

CrazyJoeDevola · September 9, 2020, 10:44am

Thanks a lot. Where can i read more about this? I suppose this is RNNoise? https://github.com/xiph/rnnoise

Are the other variables entered into the config?

Do they need to be set during the whole training or just used when syntheseizing?
Or can i change them later once the model has been trained for a while?

CrazyJoeDevola · September 9, 2020, 10:46am

Also, sometimes i get a long “noise trail” when rendering audio using either TTS test audios or together with Pwgan decoder.

I wanna apply the best possible vocoder (doesnt matter if it takes longer time). Which would your recommend?

dkreutz · September 9, 2020, 1:00pm

Yes, that is the correct one.

fmin/fmax are parameters for TTS internal audio-processor and must be set at training time, they don’t have effect at inference time.
As you want to train a vocoder model make sure to use the same audio process settings as for the training of the Taco/TTS model. If I remember correctly the dev-branch already uses same config for Taco and Vocoder training.

I don’t have too much practical experience with vocoders yet, but to my understanding best results can be achieved with ParallelWaveGAN and/or Multiband-MelGAN.