Swedish TTS process question

CrazyJoeDevola · September 4, 2020, 8:28am

Hi

I am planning to test to do a Swedish TTS from scratch with a custom voice.
If you have any input on the process please let me know

I am building a swedish large dataset with transcriptions. The set however has several different speakers.
I plan to train a model from scratch using this set
Once trained, i plan to use the SMALLER custom voice dataset and resume training for finetuning to that particular voice

Do you think this is feasible?

dkreutz · September 4, 2020, 10:17am

Make sure to read the wiki entry on datasets

Key factors to a good dataset:

clean recording, get a good microphone
consistent conditions - record always in the same room, same position, same distance from microphone, same mic settings etc.
reduce background noises as much as possible
speak in a constant pace and tonality, don not record when you feel you get a cold, are tired etc.
speak exactly as the transcript says
…

Here is the repo of the german dataset that @mrthorstenm contributed (and where I participated) - there is some more documentation: https://github.com/thorstenMueller/deep-learning-german-tts
Feel free to ask more questions…

CrazyJoeDevola · September 4, 2020, 10:23am

Hi, i intend to use an open source dataset with phrases. I know about the audio requirements, but a bit unsure about the localization part.
As we have different letters in swedish language, i suppose there are some crucial settings that need to be changed in order to train it proplerly

dkreutz · September 4, 2020, 12:08pm

In config.json check section “characters” and add special characters if missing, in case you do phoneme based training check if list of phonemes is complete.

You simply have to try if character or phoneme training gives better results.

CrazyJoeDevola · September 4, 2020, 6:05pm

Thanks! But other than that, training procedure is the same then?

dkreutz · September 4, 2020, 6:25pm

When setting “use_phonemes” to true don’t forget to set “phoneme_language” to swedish.
Also check"text_cleaner" and consider to create your own in case the default one does not fit your use-case.

bearson · September 17, 2020, 8:05am

Here are examples of Mozilla TTS trained on Swedish.
Single speaker: https://soundcloud.com/user-839318192/sets/mozillatts-tacotron2-multiband-melgan-swedish
Multi speaker: https://soundcloud.com/user-839318192/sets/multi-speaker-tacotron2-swedish

othiele · September 17, 2020, 8:14am

The single speaker sounds quite metallic but it is impressive how the multi speaker set turned out. Well done.

georroussos · September 17, 2020, 8:23am

Is the single speaker dataset open source?

bearson · September 17, 2020, 8:27am

Thanks! Small effort when compared with the huge effort made by Mozilla implementing this.

In my runs, the multiband melgan vocoder has issues with breathing. I continued training the vocoder for 1m steps but the metallic sound didn’t disappear. The pre-trained model available for download has the same issues on my data. WaveGAN sounds better overall for my data but the MB melgan vocoder sounds more realistic if you disregards the metallic sound when the model makes breathing pauses.

bearson · September 17, 2020, 8:28am

All of these datasets are unfortunately private. Datasets in Swedish are very difficult to find.

othiele · September 17, 2020, 8:33am

Yep, true, I only speak Danish and the WaveGAN is a bit better to understand for me personally But the MB melgan sounds Swedisher.

moonhouse · September 21, 2020, 9:14am

Have you tried any of the NST datasets?

bearson · September 21, 2020, 12:30pm

Yes, https://soundcloud.com/user-839318192/sets/mozillatts-tacotrongst-swedish

It’s from about a year ago and Taco1 with GriffinLim.

synesthesiam · November 18, 2020, 3:37am

Thank you for the link! I’m work on a Swedish speech to text and a text to speech model for the Rhasspy voice assistant. If I can get these datasets downloaded, they will be very helpful

AlHuman · March 10, 2021, 9:34pm

How did you produce these datasets, interviews and transcribation tools?

Also, did you use a pretrained network and transfer learn or fine tune? Or did you just use a single speaker set?