Hello everyone, I created a custom German dataset extracting audiofiles from a game called Gothic. I’ve succesfully trained a model using the repo from fatchord on one speaker. Here you can hear samples from the main hero of the game.
Any tips on how I would go ahead to train a multispeaker model? Would I have to split every speaker into separate folders? Is there an entry point I could start reading about this topic?
I’m also interested in training a vocoder, would it be possible to train an universal vocoder explicit for this dataset?
Thanks for the repository and all the work you put into.
Best regards.
Sounds great. Is the game State of Mind? (Just curious)
TTS should already work for multi speaker. The best way is to format your dataset as LibriTTS and use the formatter already there.
Regarding the vocoder. We already trained and released an universal vocoder in WaveRNN repo https://github.com/erogol/WaveRNN . It works for different speakers and languages nicely.
I’m currently in my exam period, so I didn’t have much time experimenting.
I’ve trained a model to 130k steps, using Tacotron multi speaker and GST. It learned a decent attention at 130k steps. I’ll include more speakers and fully train the model, once I have more time. Passing a style_wav during inference, also changed the prosody of the generated speech. I’m still not sure how to work with the Tokens from GST but I’ll read more about it, once the exams are over.
I haven’t tried the universal vocoder yet, but my dataset has a SR of 22050. How would I go ahead and finetune the existing model?
Here are some more examples from the multi speaker model. Generated at around ~320k steps. Also had the warning “> Decoder stopped with ‘max_decoder_steps’”, but it fixed itself after some iterations.
…
On another note. If I want to extract the spectograms from the TTS model, how would i go ahead doing so? The notebook ExtractTTSpectrogram.ipynb hast the following lines.
# TODO: multiple speaker model = setup_model(num_chars, num_speakers=0, c=C)
Any tips on what changes are needed to extract mels from a multi speaker model?
Help on this matter would be highly appreciated.
Thx for sharing the results. It looks quite interesting and given the figures, your training looks quite healthy.
For using the notebook, you need to give the right number of speakers to the model and you need to provide the list of speakers. Speakers are already exported to a json file. You need to load them and pass the list to the model as well. If you look at the source code how it does the inference, it is quite easy to implement it.
You might also consider to set audio parameters. Because in general male and female voices require different values. You can try CheckSpectrograms notebook to optimize these values.
Thanks for the help! Modifying the notebook worked quite well once i knew what to do. I’m currently retraining the model because it broke after getting to r=1. It didn’t recover from > Decoder stopped with ‘max_decoder_steps’ and broken even on short sentences. I also cleaned up the dataset, since it had some samples where the speaker would laugh -> Text (laught).
Regarding the audio parameters, i used the notebook to evaluate some of the values, but i would say the default values are good. Then again im not 100% sure what to pay attention to.
The average of “mel = AP.melspectrogram(wav)” fmel_min ~ 0.0000 and fmel_max ~ 0.7900
Did you get a sense that the memory_size changing to 5 made a significant difference?
From the look of the lines on the charts they look fairly similar before and after, but maybe they helped with qualities not apparent in the charts alone.
I’m in a similar position with my training and am on the brink of attempting the 400k switch to finish off the model on BN. It’s still running right now but based on the charts it actually seems to be doing okay on r=1 (touch wood!), which had never happened with previous runs with earlier versions of the code. Thanks for posting your progress - I’ll be doing the same once I’ve tried the BN switch.
I would say the only real difference I noticed was, that it stopped producing > Decoder stopped with ‘max_decoder_steps’. But if it’s in correlation with the memory_size change or me cleaning up the dataset, I cannot tell.
I now made the switch to BN at ~410k steps, let’s see how it performs.
I’ll try the same for r=1 after the current run and post the results for comparison.
On another note… I also tried training WaveRNN but will get
File "/home/alexander/anaconda3/lib/python3.7/site-packages/torch-1.4.0-py3.7-linux-x86_64.egg/torch/utils/data/_utils/fetch.py", line 47, in fetch
return self.collate_fn(data)
File "/media/alexander/LinuxFS/Documents/PycharmProjects/TTS/WaveRNN/dataset.py", line 74, in collate
coarse = np.stack(coarse).astype(np.float32)
File "<__array_function__ internals>", line 6, in stack
File "/home/alexander/anaconda3/lib/python3.7/site-packages/numpy/core/shape_base.py", line 425, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape
This user at github had the same issue
any tips?
PPS: Like here Contributing my german voice for tts i also noticed that the model has some problems with umlauts. It will pronounce o as ö and vice versa. Are graphemes the better choice here?
Thanks @sanjaesc. I have also tried pwgan too based on erogol project. May i ask which melgan repo that you tried ? I have tried this melgan repo but hop size could not change into 256, which is correspond to mozilla TTS.