I am trying to fine-tune TTS models further. For that purpose, I am trying to collect a dataset where hopefully the audio samples are more expressive in nature. It is tough to collect a dataset in only one voice such as LJSpeech dataset. If I collect some good quality dataset but with multiple different voices spread across the audio samples, will it be difficult to train the model on this dataset? I can make sure that one audio sample has only one distinct voice. But across samples, this may not hold true.
Are you looking for multispeaker datasets? There’s many. You can go with LibriTTS. Since it comprises of audiobooks, it tends to be expressive.
Yes, but would I be able to train on a multispeaker dataset? How does the model learn a final voice to speak in? Does the model take a speaker ID as well as input in that case?
That is the point, that it learns to speak in many voices.
If you choose single speaker TTS, it is one voice you need and a multispeaker dataset will not work. If you choose multispeaker, you train using speaker embeddings, so then you get many voices. If you are looking for expressiveness, I would recommend @edresson1’s fork, which implements GST with Taco2, combined with LibriTTS. You can get good results.