Multispeaker versus transfer learning

Inspired by @mrthorstenm I decided to create my own dataset as well. Starting from my own desk with a crappy headset microphone, I soon moved on to more professional methods.

In the end I hired two male voice talents, they will each provide me with 20-25 hours of Belgian Dutch voice data over the course of the coming two months. My aim is to create other voices from this data as well, hopefully with a minimum of data. I asked them to record in mono WAV format, 44.1 kHz and 16-bit audio.

Should I train two separate tacatron2 models, check which one is most suitable, and use transfer learning or is the state of the current multi-speaker training good enough and easier to work with to generate future voices?

Are there any other tips or suggestions which I should think about?

Any help or input is appreciated.

2 Likes

Try all three? One t2 model for each individual, and a multispeaker. Then based on which turns out best for pronunciation, use that to do transfer learning (or continued learning as needed) from.

1 Like

I agree with Baconator, see what works best (varying model run settings to get a feel for how it impacts results is important, you’re unlikely to strike jackpot the first time :slightly_smiling_face:)

Your amount of audio data seems sensible.

I would give some thought to the content / material you have them read. A couple of points to watch for:

  • try to get it quite diverse, covering the styles of speech it could be used for
  • include questions, statements; formal and casual text
  • balance the length (look at the Notebooks in TTS as they help with distribution of lengths)
  • aim for a wide vocab if you can
  • the key part in vocab is trying to provide the model with examples of all the distinct sounds and, as much as possible, all combinations of sounds (if it didn’t ever hear eg “D” in training it won’t know what to do in inference time, when it’s creating output)

For your readers, it’s ideal to have them aim to generally give a middle of the road spoken style as consistently as possible.

  • it needn’t be a monotone lacking all emotion but you want fairly consistent style for similar speech otherwise the model can get confused
  • likewise with volume and pace, aim for consistent delivery
  • I gather with voice pro’s it’s worth having a couple of their good samples ready to play at the start of each session, so they can pick up that style again easily before they start ploughing through hours of recordings
  • clarify the importance of matching the transcript (if it says “he is” they need to avoid the temptation to say “he’s” or the model will struggle with the mismatch); if this happens by mistake and is found after, you can update the transcript to be consistent but that’s more work than avoiding it generally

I think you’re right to capture high quality wave files. However typically I’ve seen most people use 22 or 24kHz samples for training, so you may want to have a go down sampling for a run to see how it works (it should be quicker to train)

It will probably help too if they can record in a room without too much reverb.

Other than that, I’d have a browse over posts here to see what other points have come up for scenarios that sound like what you’re trying to do.

Hope it goes well! I’d be keen to follow how it works out.

3 Likes

Update:

Both readers have delivered approx. 8000 sentences. Unfortunately, since splitting and cleaning the audio was an extremely time consuming task for both of them, I asked them to deliver their continuous recordings instead.

This means that only the first ~2000 audio files are nicely split and verified.
The other sentences are bundled together in bigger files, ranging from 50 to 500 sentences in one single wav file.

If anyone has experience or tips with automatically splitting/cleaning audio files, please feel free to give input. Otherwise I’ll start with testing out the top StackOverflow answers in the coming days.

Aeneas is worth a look - provides timestamp mappings between text and audio. The timestamps can be used to cut the audio files…

Tried it briefly for splitting german M-AILABS audio books, but there is no „automagic“ - it requires a lot of finetunig ofparameters, otherwise you will end up with cuts n the middle of words …

1 Like

Yes, Aeneus is good for this kind of thing and just as @dkreutz says it likely will need fine-tuning.

Aeneus has good documentation. Also there’s an article here that goes over a similar use case: https://medium.com/@klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf

NB the article shows producing a corpus for Speech recognition (ie the inverse) so the final steps where it outputs the file format for that would need to be adapted but it’s pretty straightforward. The “finetuneas” tool does speed up getting the cut timings just right but it’ll still need work. If they’ve left reasonable pauses between sentences then you may find this is less burdensome, it’s really where the gap is narrow that the forced alignment can go off the most.

There was a branch of DeepSpeech adapted to do forced alignment too and I gather they used it for a substantial corpus import but I never got round to trying it myself (although it must’ve worked well enough as they got something like 1700 hours of audio aligned!)

I guess you mean DS Align, but it needs a working DeepSpeech model:

And you could use a big API like Google for recognition and align with your data.

Good point, I’d overlooked the lack of a model for Belgian Dutch to use with DS Align :slightly_frowning_face:

Aeneas worked absolute wonders, the only issue which remains is that I have some silence leading/trailing the actual cut-out fragments. One reader left about 2 - 4 seconds of silence in between fragments, while the other used a consistent 0.5 - 1 second gap. The long silences around each fragment doesn’t make the first dataset suitable for any practical purposes, and thus they must be removed.

My current approach to remove the silence isn’t very elegant. The files aren’t happy to be collectively processed with the same input parameters (treshhold dB, chunk size, …) and thus either get cut-off too much or way too little. Even for smaller batches of files it is very hard to get this right.

To ensure the quality of the dataset, I’m afraid that I’ll have to jump into audio editing software myself and remove the silences manually. Before I’ll do this as a last resort, I’ll scout for some other common approaches used for dataset preparation.

1 Like

When I use Aeneas, I find adding task_adjust_boundary_algorithm=percent|task_adjust_boundary_percent_value=50 within the config helps. If the audio recording is larger than one hour, I split to one hour. With longer texts Aeneas has been kind of bad for me. Then, I inspect the alignment with Praat, because it almost always splits on a plosive (occlusion/release phase) and when I make sure everything is aligned okay, I split to sentences using the timestamps. After that I pass the sentences through WavePad (it has an option to only remove trailing silences). It works :slight_smile:

1 Like

Thanks to everyone for their help so far! I ended up with 15.000 fragments from one speaker and around 4.000 fragments from another, who was unable to fulfill the agreement of delivering 15k.

Nevertheless, this will still allow me to experiment with transfer learning.

I uploaded the dataset of the first speaker and I might add the other 4k sentences if anyone shows interest.

If anyone is able to suggest me some good configuration settings or share other tips regarding this dataset, please feel free to do so. I’m currently at 245k iterations on almost-default settings, which might need some tweaking.

I will add some synthesized voice samples soon.