How do I create a preprocess script for a custom dataset?

I have a dataset that has around 1600 samples, and I notice in order to finetune on tts I need a preprocessor. The information for this is pretty scarce and was wondering if anyone could give me pointers

Also, when do I run it? Thanks!

Thanks!

Hi @Rio

Have you looked in the wiki?

These preprocessors are simply a bit of code to get your data (in whatever directory and file format you have) loaded into the training programme.

You don’t mention anything about how your data is currently set up but if you can, you might find it easier to organise your data into the format of an existing preprocessor - the one I use is LJSpeech.

It’s worth looking over the code in dataset/preprocess.py it’s nothing too complicated and should be easy to see what it’s doing. You don’t run it separately, it’s called when your data it loaded and the preprocessor you pick in your config is the one that gets used.

A quick overview of the LJSpeech layout: there’s a folder /wavs for all your wav files (as you’d probably guessed!) And the two CSV files (training and validation) with rows made up of the corresponding filename stem (without the .wav extension) and the text corresponding to the audio. It’s actually a pipe separated file (not CSV really). LJSpeech has the normalised text and the raw text - if you’ve got it already normalised then these can be the same.

If you get stuck or can’t visualise it from this description then perhaps it’s worth downloading LJSpeech so you can see the way it’s organised. As an added bonus, then you’ll be able to do some initial training with LJSpeech data - I’d strongly recommend that over jumping into something you’ve never done before with brand new data, which is a recipe for confusion. Using LJSpeech for an initial run means you’ll get the hang of the basics, flush out human error on your part, and be reasonably sure that you’re not running into issues due to your data, because you’re working with a known good dataset.

I’d suggest having a decent look over the repo too. Generally proof of effort improves the chances of others helping you.

Maybe take a peak in the dev branch too because the README there has been smartened up so you should find links to the info you need quite easily.

Hope that’s a start at least :slightly_smiling_face:

1 Like

Thank you for the help!

I was doing something similar with the LJSpeech layout, although you have def sped up my understanding of the layout

I didnt notice the dev branch for whatever reason during my scouring of the repo, although rest assured I looked around. Thanks for the help and stay safe!

1 Like

Glad to help!
BTW, since I wrote the answer above the README in master has been updated from dev so you should be good looking in master now.

Hi. I am new in this area so I was searching in wiki FAQ but I didn’t find a solution to preprocess a custom dataset. I want to use MozillaTTS with a own spanish dataset. Could you please help me? Thanks a lot.

Am confused. Did you read my answer above too?

And you looked at the rest of the wiki right? It’s not that much to skim.

Anyway, to save you time, here’s the page you need to understand processing https://github.com/mozilla/TTS/wiki/Dataset

Preprocessing that’s being discussed here is loading the data, so it’s not generally going to depend the language, it’ll depend on the format of your data.

One possiblity is that you’re thinking more about the cleaning functions for your text, and if so then you’d need to look at the code here (especially cleaners.py). I haven’t worked with non-English transcriptions but I’m guessing it would be similar(ish) with Spanish but there are bound to be some language differences. If it’s this you’re after advice on, I know there are others here who’ve worked in other languages so maybe they can help. Probably still worth looking over the code I link to though, so you’ll have an idea of what it’s doing with English and can then think about what would be different with Spanish.