You guessed right, this will be a rant.
I tried for dozens of hours to have a model convert.
It seems so close yet it is so far.
I used Aeneas to cut a book reading to the text.
I took samples and they looked fine.
I used MTTS with it and it reached max decoder steps each time.
I figured that I was just looking at the good files, most were cut the wrong way.
So I was going through them and realized that 4000 files were a tremendous work to go through.
So I used the tools available here and removed some files.
Still, many useless files there. And suddenly a good file among the others.
I often read about clipping audio, yet I normalized all audio with dynaudnorm from ffmpeg, so clipping should not be there.
But I noticed clipping in other models that converted better.
What is a good maximum dB value anyway?
Now I am building upon the german dataset, all in all my dataset is now 3 hours long.
The samples are at most 11 seconds long.
I am using Batch size 32.
Ironically, the best results were achieved by using speech recognition of voice samples and not forced alignment.
Creating good datasets.
We came so far.
We only have to present a sentence to a wave file.
But it’s still a tremendous amount of work to deliver that.
We can be happy to be alive now and not at the time when you manually had to dissect audio files to create tts models.
In my opinion there is a long way to go still for this to be mostly effortless.
Yet I am happy to be part of it.