Swedish TTS process question

Hi

I am planning to test to do a Swedish TTS from scratch with a custom voice.
If you have any input on the process please let me know

  1. I am building a swedish large dataset with transcriptions. The set however has several different speakers.
  2. I plan to train a model from scratch using this set
  3. Once trained, i plan to use the SMALLER custom voice dataset and resume training for finetuning to that particular voice

Do you think this is feasible?

Make sure to read the wiki entry on datasets

Key factors to a good dataset:

  • clean recording, get a good microphone
  • consistent conditions - record always in the same room, same position, same distance from microphone, same mic settings etc.
  • reduce background noises as much as possible
  • speak in a constant pace and tonality, don not record when you feel you get a cold, are tired etc.
  • speak exactly as the transcript says
  • …

Here is the repo of the german dataset that @mrthorstenm contributed (and where I participated) - there is some more documentation: https://github.com/thorstenMueller/deep-learning-german-tts
Feel free to ask more questions…

Hi, i intend to use an open source dataset with phrases. I know about the audio requirements, but a bit unsure about the localization part.
As we have different letters in swedish language, i suppose there are some crucial settings that need to be changed in order to train it proplerly

In config.json check section “characters” and add special characters if missing, in case you do phoneme based training check if list of phonemes is complete.

You simply have to try if character or phoneme training gives better results.

Thanks! But other than that, training procedure is the same then?

When setting “use_phonemes” to true don’t forget to set “phoneme_language” to swedish.
Also check"text_cleaner" and consider to create your own in case the default one does not fit your use-case.

1 Like

Here are examples of Mozilla TTS trained on Swedish.
Single speaker: https://soundcloud.com/user-839318192/sets/mozillatts-tacotron2-multiband-melgan-swedish
Multi speaker: https://soundcloud.com/user-839318192/sets/multi-speaker-tacotron2-swedish

1 Like

The single speaker sounds quite metallic but it is impressive how the multi speaker set turned out. Well done.

Is the single speaker dataset open source?

Thanks! Small effort when compared with the huge effort made by Mozilla implementing this.

In my runs, the multiband melgan vocoder has issues with breathing. I continued training the vocoder for 1m steps but the metallic sound didn’t disappear. The pre-trained model available for download has the same issues on my data. WaveGAN sounds better overall for my data but the MB melgan vocoder sounds more realistic if you disregards the metallic sound when the model makes breathing pauses.

All of these datasets are unfortunately private. Datasets in Swedish are very difficult to find.

Yep, true, I only speak Danish and the WaveGAN is a bit better to understand for me personally :slight_smile: But the MB melgan sounds Swedisher.

Have you tried any of the NST datasets?

1 Like

Yes, https://soundcloud.com/user-839318192/sets/mozillatts-tacotrongst-swedish

It’s from about a year ago and Taco1 with GriffinLim.

1 Like

Thank you for the link! I’m work on a Swedish speech to text and a text to speech model for the Rhasspy voice assistant. If I can get these datasets downloaded, they will be very helpful :slight_smile:

How did you produce these datasets, interviews and transcribation tools?

Also, did you use a pretrained network and transfer learn or fine tune? Or did you just use a single speaker set?