Punctuation and speed of reading

Has anyone looked at the topics of punctuation and/or reading speed in TTS?

For punctuation, a couple of months back I had a go with an adjustment to the output from espeak so that it kept commas which normally get taken out of input text. The model trained with it was then responsive to commas but it had degraded speech quality. If there’s interest, I can write up the process and maybe I’ll try again (as MelGAN has moved quality forward dramatically)

I’m also interested in speed of the output. It’s no doubt largely determined by my dataset but it definitely seems to read a touch faster than expected. I might try adding a postprocessing step so I can adjust this outside the model. Wondering where a GST approach might help there instead? (not something I’ve looked at closely)

Any suggestions on either of these points?

I’d say the best way is to do postprocessing. The rest is always open to future inconveniences.

I’d also agree that enabling punctuations makes the training harder but good for right prosody. One option could be replacing all punctuations with single symbol.

Yes, good idea on the single symbol point. I can see brackets and commas for subclauses having a similar effect so this sounds promising.

The model with Forward Attn+BN does quite well with punctuation, if it is a phrase break or a sentence end; case in point, I was experimenting with length of sentences because that is obviously something the TTS has problems with, so cutting the text to further smaller sentences helps – the problem then is the intonation, however when given a comma instead of a fullstop, the TTS does great and I don’t even remember where I sliced the sentence.

PRE_Reading_news.wav.zip (500.6 KB)

1 Like

Thanks @georroussos. Have you done anything special to the code and I assume you’re using phonemizer still? (Haven’t been able to listen to your samples yet)

The reason I ask is that from my understanding the model doesn’t actually see the punctuation because it doesn’t get passed on through the phonemizer stage - so in those cases where it’s doing well with a comma, I believe it’s inferring the sentence structure from the words. Therefore with the default setup you can’t control for cases where a comma makes a difference, such as this well known pair of similar sentences with distinct meaning:

  1. Let’s eat, grandma
  2. Let’s eat grandma


I am using the phonemizer yes! Did not do away with it. I remember that I tried the same sentence with a fullstop instead of a comma and the intonation was, indeed, different. I wonder how we can control this. I would guess a very good, professionally recorded dataset that is 100% accurately transcribed would definitely help.

Example sounds good. Is this a pre-trained model? Or training from scratch?

The fullstop is ending the sentence, so in that case the model is working on a shorter sentence and therefore you get that different intonation but the one with the comma would, I believe, be acting the same as if there was no comma there at all, because they’re striped before the model gets to see it. Therefore before a precisely transcribed dataset can add value, the code would need tweaking to make it pass on the commas in some manner.

The experiment I mentioned above to work with punctuation made use of a “marker” symbol that I wrap around the desired punctuation character(s), the marker gets passed through espeak and the punctuation can then be added back to the output phoneme text. I’ll look at writing it up in more detail and sharing the code.