Multiple voices dataset

Daksh_Varshneya · May 22, 2020, 8:23am

Hi,
I am trying to fine-tune TTS models further. For that purpose, I am trying to collect a dataset where hopefully the audio samples are more expressive in nature. It is tough to collect a dataset in only one voice such as LJSpeech dataset. If I collect some good quality dataset but with multiple different voices spread across the audio samples, will it be difficult to train the model on this dataset? I can make sure that one audio sample has only one distinct voice. But across samples, this may not hold true.
Any suggestions?

georroussos · May 22, 2020, 10:24am

Are you looking for multispeaker datasets? There’s many. You can go with LibriTTS. Since it comprises of audiobooks, it tends to be expressive.

Daksh_Varshneya · May 22, 2020, 10:49am

Yes, but would I be able to train on a multispeaker dataset? How does the model learn a final voice to speak in? Does the model take a speaker ID as well as input in that case?

georroussos · May 22, 2020, 11:04am

That is the point, that it learns to speak in many voices.

If you choose single speaker TTS, it is one voice you need and a multispeaker dataset will not work. If you choose multispeaker, you train using speaker embeddings, so then you get many voices. If you are looking for expressiveness, I would recommend @edresson1’s fork, which implements GST with Taco2, combined with LibriTTS. You can get good results.