Results can be so much better, we are all doing it wrong

I just read this.

I started to wonder: are we wasting our time here?

We try to accomplish the same, but most soundcloud samples are not convincing to say the least.

And this guy uses the same technologies with way better results.

What am I overlooking here?

  • Voice talent/trained speaker recorded with studio equipment and professional post-processing. Costs a lot of money (think at least five digits). Compare this to untrained speaker with 50$ usb podcast microphone recording in the bedroom.
  • Most likely the data-set was created for usage in computer games. The sound examples are from the gaming context as well - so no surprise that it works well there. Are there any links to real world examples, e.g. how would it sounds when speaking a news article or weather report?
  • The original/tacotron comparison seems to be ground truth inference (model has seen the input at training) - this always sounds good as the original.
  • Waveglow vocoder sounds good but requires high amount of computing resources for training and inference. This works for gaming context where the spoken voice is pre-recorded. I personally aim for realtime inference on restricted hardware - this makes Waveglow not a first choice for me.

Link to Nvidia FastPitch is interesting, did not here about it before.

For comparison you might want to look out for @sanjaesc work with the “Gothic” dataset.