Best model + vocoder combination for realistic speech generation

Hi, I’m looking to implement this in a project. The goal is to generate speech that sounds a lot like a regular human. Does anyone know the best model and vocoder combination that can achieve this?

For the german “Thorsten” dataset we are seeing good results with a TTS-model having DDC enabled and WaveGrad vocoder model. This is still highly depending on the quality of your dataset and intended use-case, though.

