Thanks for the help! Modifying the notebook worked quite well once i knew what to do. I’m currently retraining the model because it broke after getting to r=1. It didn’t recover from > Decoder stopped with ‘max_decoder_steps’ and broken even on short sentences. I also cleaned up the dataset, since it had some samples where the speaker would laugh -> Text (laught).
Regarding the audio parameters, i used the notebook to evaluate some of the values, but i would say the default values are good. Then again im not 100% sure what to pay attention to.
The average of “mel = AP.melspectrogram(wav)” fmel_min ~ 0.0000 and fmel_max ~ 0.7900
Did you get a sense that the memory_size changing to 5 made a significant difference?
From the look of the lines on the charts they look fairly similar before and after, but maybe they helped with qualities not apparent in the charts alone.
I’m in a similar position with my training and am on the brink of attempting the 400k switch to finish off the model on BN. It’s still running right now but based on the charts it actually seems to be doing okay on r=1 (touch wood!), which had never happened with previous runs with earlier versions of the code. Thanks for posting your progress - I’ll be doing the same once I’ve tried the BN switch.
I would say the only real difference I noticed was, that it stopped producing > Decoder stopped with ‘max_decoder_steps’. But if it’s in correlation with the memory_size change or me cleaning up the dataset, I cannot tell.
I now made the switch to BN at ~410k steps, let’s see how it performs.
I’ll try the same for r=1 after the current run and post the results for comparison.
On another note… I also tried training WaveRNN but will get
File "/home/alexander/anaconda3/lib/python3.7/site-packages/torch-1.4.0-py3.7-linux-x86_64.egg/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/media/alexander/LinuxFS/Documents/PycharmProjects/TTS/WaveRNN/dataset.py", line 74, in collate
coarse = np.stack(coarse).astype(np.float32)
File "<__array_function__ internals>", line 6, in stack
File "/home/alexander/anaconda3/lib/python3.7/site-packages/numpy/core/shape_base.py", line 425, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape
This user at github had the same issue
any tips?
PPS: Like here Contributing my german voice for tts i also noticed that the model has some problems with umlauts. It will pronounce o as ö and vice versa. Are graphemes the better choice here?
Thanks @sanjaesc. I have also tried pwgan too based on erogol project. May i ask which melgan repo that you tried ? I have tried this melgan repo but hop size could not change into 256, which is correspond to mozilla TTS.
For me it was pretty straight forward, I just had to make a small adjustment for the slightly smaller memory on my 1080Ti GPU. It’s worth training longer than the 400k I did initially.
I’m currently re-training the model without using phonemes, cause of the named problems with umlauts. The results are way better!
I think it would make more sense to upload the model I’m currently training.
Will upload once done training.
thx for updating. Might be that the default char set does not include the umlaut chars. Have you edited that? Or you can give custom char sets soon on dev branc in config.json