How to train with 18 hours of annotated audio

joshua.eisenberg · June 10, 2019, 4:23pm

Hey all,

Love the mozilla TTS library. Previously I’ve used the 185k iterations model, but now I want to train my own model. Given @erogol’s suggestion, https://github.com/mozilla/TTS/issues/165, I had 18 hours of audio produced by a voice actor, recorded to WAV files, that have been annotated, and they are each one or two sentences long.

My question is, how do I actually use the TTS library to train on my corpus? Can someone point me to the right approach? I’m guessing it’s not as simple as pointing to a corpus manifest, but I could be wrong. I couldn’t find anything about how to set up my workspace for training, so I wanted to see if anyone here had any tips for training a new model with mozilla TTS.

Thanks!

aolney · June 11, 2019, 12:36am

I’m still working on this myself, but the easiest way (IMHO) is to reformat the data using the conventions of a well-known dataset. I’m using LJSpeech. Just put your data in that format and point TTS at it. Also see the wiki post on this topic. The diagnostic notebooks it references are very useful for finding problems in your dataset. Since yours was carefully curated, you probably don’t have many problems, but I’d still recommend checking.

joshua.eisenberg · June 11, 2019, 2:47pm

Good to know other people are working on this too! I’ll report back with what I tried. Time to reformat / repackage the data into the right holes / names / manifests. I’m also curious how long training will take. I think that I read around 185K iterations of learning has led to some of the best tts models for this system. I wonder how long each iteration is going to take. Good luck @aolney!

aolney · June 11, 2019, 2:55pm

I’m doing a run on a 1080ti right now, and it’s done about 60K iterations in 15 hours. The speech is somewhat intelligible but not great yet

joshua.eisenberg · June 11, 2019, 3:04pm

Whew. That’s reassuring. I’m just glad it doesn’t take a day for 3 iterations hehehe. good news.

nmstoker · June 11, 2019, 3:40pm

I agree with Andrew that following LJSpeech is a good way to go.

If you have time it might be worth actually trying to train with the LJSpeech dataset first as then you’ll have ironed out basic issues and know what’s a reasonable time / outcome on your hardware - if you jump straight into a new dataset with a process you’re not familiar with, it compounds the challenges of figuring out where problems are! The CoLab here (link from Readme) is a good start for that: https://gist.github.com/erogol/97516ad65b44dbddb8cd694953187c5b
Plus you’ll then have an example of LJSpeech format right there in front of you, so you know what you’re aiming for.

aolney · June 18, 2019, 7:20pm

FWIW I’ve decided I have to manually clean my data, so I’ve built a web-based tool to facilitate that process. Here’s the blog post with links to GitHub and walk-through video: https://olney.ai/category/2019/06/18/manualalignment.html

joshua.eisenberg · June 18, 2019, 8:01pm

Awesome. I will definitely test this out in a couple weeks. I’m actually having two data sets being created now. One with a voice actor, and one with clips of speech harvested from the internet. I think this tool will be super useful for the speech harvested from the internet

THANKS @aolney!!!
cheers

guitarplayersachin · January 29, 2021, 4:15am

Hey, Any updates on this, were you able to get good results ?