first: Sorry if this forum is not meant for people to ask general questions regarding the speech synthesis/TTS field. This is basically the only active forum on TTS i found and where i can ask some (basic) questions.
So yeah, basically I’m relatively new to the TTS field and i have a questions about the data. Most datasets seem too be existing out of either a single speaker, or multiple speakers and then clearly annotated which speaker said which sentence.
why is this? will a model underprerform drastically when I would just train using a dataset with multiple speakers and not using a one-hot encoding vector as amazon did.
I already played around with mozilla deepspeech (STT) before and here the model was able to get text from the MFCC no matter what speaker said it. But for TTS it seems that the other way around seems to be harder to generalize for multiple speakers, why is this? Or how could i fix this problem when i don’t know which reader said which sentence?
Thank you in advance !
ps: i would also appreciate if poeple post papers or relevant links to this problem or just general introductions to TTS since i do not seem to succes into finding a very in depth documentation about TTS to lecture myself (all i can fins are papers of models, but none about the whole process and the important details to keep in mind )
Hi @Plato - I think this is a perfectly reasonable thing to ask about. Whilst it’s general, this kind of discussion seems useful for people using the TTS repo.
First to address your PS:
If you want some relevant papers on TTS, then there’s a great repo listing several key ones here:
Those are academic papers - several of them will include links to samples and often GitHub repos for the authors own code. In a few notable cases the models covered have been implemented in the TTS repo.
If you want something more beginner friendly then googling is best - often there are university lectures on YouTube for topics like this that can be a good start. I don’t have any to hand now but I’ll see what i can dig out.
Here’s something that gives a bit of an introduction over time, but if you’re motivated mainly by getting up to speed then I’m guessing you may not need to go into too much depth on the older methods (ie < deep learning approaches) but it’s handy to know they exist and often resources related to them can still be useful.
The Wikipedia entry is also pretty comprehensive
Now to your main question. I can give a partial answer but you may benefit from comments from those more involved in multi speaker usage.
Firstly, this is something of an empirical subject - with so many factors, often doing gives the best insights.
My suspicion is that if you have multiple voices but they are not identified to the model it will struggle enormously to create a decent voice - if you look at the waveform for different speakers they can be dramatically different. So the model will be simultaneously be attempting to get good for samples from one speaker and another and it won’t know which is right. You as a human who understands speech may be overlooking just how different different speakers are because you’ve learned to understand them.
This comes up to an extent even when training with a single voice. If the samples from the speaker are too varied (ie differing in style too much, and arbitrarily) then the voice quality struggles.
One example where this happens is with the EK1 dataset trained model (search in the forum for that if interested). In that dataset, derived from LibriVox novels, the speaker often reads characters in this corresponding accent. I suspect this impairs the quality somewhat and it’s only thanks to the substantial amount of training data in that set (32+ hrs) that overcomes it.
Not sure if this is practical but depending on how many speakers you think you have, you may be able to use speaker diarisation techniques to identify them and label your dataset from the clusters. One of the notebooks in the repo has code to make a start on that https://github.com/mozilla/TTS/blob/master/notebooks/PlotUmapLibriTTS.ipynb (although I’m having trouble getting it to display currently)
@nmstoker thank you for the extensive reply! It really helped me to understand it a bit further and the extra resources are great!
When looking a bit more around into vocoders i ended up with the same question. Must a vocoder be trained on 1 person, or can it be trained on multiple speakers without using a speaker embedding / one hot vector encoding.
For example: can i just feed it MFCC, audio pairs and expect a good result, or should i feed it MFCC,Audio, speaker triplets ? Because from most papers that i’ve read my feelings tells me that just MFCC, audio pairs should be enough. But if this is the case, how does the output voice sound like? Can it be different dependent on the MFCC or will it be the same voice in the wav everytime?
So a vocoder can be a single voice or multiple ones (to create a universal vocoder). I’ve only trained them in the single voice but there is a universal vocoder available via the latest version of the TTS repo
For the single voice case you just give it the audio and it creates the spectrograms itself. I believe that it’s the same with multi voice but it’s worth looking at the code/config files in case I’ve missed something for multiple voice usage