How to train with 18 hours of annotated audio

Hey all,

Love the mozilla TTS library. Previously I’ve used the 185k iterations model, but now I want to train my own model. Given @erogol’s suggestion,, I had 18 hours of audio produced by a voice actor, recorded to WAV files, that have been annotated, and they are each one or two sentences long.

My question is, how do I actually use the TTS library to train on my corpus? Can someone point me to the right approach? I’m guessing it’s not as simple as pointing to a corpus manifest, but I could be wrong. I couldn’t find anything about how to set up my workspace for training, so I wanted to see if anyone here had any tips for training a new model with mozilla TTS.


I’m still working on this myself, but the easiest way (IMHO) is to reformat the data using the conventions of a well-known dataset. I’m using LJSpeech. Just put your data in that format and point TTS at it. Also see the wiki post on this topic. The diagnostic notebooks it references are very useful for finding problems in your dataset. Since yours was carefully curated, you probably don’t have many problems, but I’d still recommend checking.

Good to know other people are working on this too! I’ll report back with what I tried. Time to reformat / repackage the data into the right holes / names / manifests. I’m also curious how long training will take. I think that I read around 185K iterations of learning has led to some of the best tts models for this system. I wonder how long each iteration is going to take. Good luck @aolney!

I’m doing a run on a 1080ti right now, and it’s done about 60K iterations in 15 hours. The speech is somewhat intelligible but not great yet :slight_smile:

Whew. That’s reassuring. I’m just glad it doesn’t take a day for 3 iterations hehehe. good news.

I agree with Andrew that following LJSpeech is a good way to go.

If you have time it might be worth actually trying to train with the LJSpeech dataset first as then you’ll have ironed out basic issues and know what’s a reasonable time / outcome on your hardware - if you jump straight into a new dataset with a process you’re not familiar with, it compounds the challenges of figuring out where problems are! The CoLab here (link from Readme) is a good start for that:
Plus you’ll then have an example of LJSpeech format right there in front of you, so you know what you’re aiming for.

FWIW I’ve decided I have to manually clean my data, so I’ve built a web-based tool to facilitate that process. Here’s the blog post with links to GitHub and walk-through video:

Awesome. I will definitely test this out in a couple weeks. I’m actually having two data sets being created now. One with a voice actor, and one with clips of speech harvested from the internet. I think this tool will be super useful for the speech harvested from the internet

THANKS @aolney!!!

Hey, Any updates on this, were you able to get good results ?