Increase the Volume of the Voice during Inference

btomtom5 · June 6, 2019, 1:15am

Does anyone know how to increase the volume of the synthesized voices?
I’ve tried the following things to no avail. It seems like parts of the audio inevitably get ‘clipped’ and become noisy. My guess is that it’s because some parts of the audio are quiet and the other parts are loud so that it becomes difficult to amplify the whole clip without creating noise in other parts.

Thanks in advance. Any guidance would be much appreciated!

nmstoker · June 6, 2019, 10:50pm

Hi @btomtom5 - welcome to the forum

I haven’t experienced the issue you mention but in case you haven’t seen it, there’s some info on the wiki that goes over finding parameters that work well for a particular dataset and I wonder if this might help you with the noise issue and in turn help the model produce results in a more normal range

This section specifically:

CheckSpectrograms is to measure the noise level of the clips and find good audio processing parameters. Noise level might be observed by checking spectrograms. If spectrograms look cluttered, especially in silent parts, this dataset might not be a good candidate for a TTS project. If your voice clips are too noisy in the background, it makes things harder for your model to learn the alignment and the final result might be different than the voice you are given. If the spectrograms look good, then the next step is to find good set of audio processing parameters, defined in config.json . In the notebook, you can compare different set of parameters and see the resynthesis results in relation to given ground-truth. Find the best parameters that give the best possible synthesis performance.

nmstoker · June 7, 2019, 3:45am

Also, did you miss something out about what you’ve tried?

btomtom5 · June 21, 2019, 2:00am

Thanks @nmstoker! I’ve looked at the check spectrogram notebook before and I haven’t had any problems with training TTS. I was trying to amplify the final output waveform but found that it was hard to do so without causing the loudest parts of the wave to clip. I still haven’t found the solution for this unfortunately. Perhaps better normalization before training would help. If I find anything, I’ll report back.