Sentences which trigger an endless loop

Hi all,

I’ve discovered two sentences which seem to generate an endless looping result from the model. I’ve been using https://github.com/synesthesiam/docker-mozillatts/ so I don’t have an easier python MWE currently, sorry. We had some discussion in an issue as well.

The two sentences (so far) are

you whiskey. uWSGI

Which loops endlessly on “GI gi gi gi gi gi gi” at the end, and

Some important terms you should know.

Which loops on the “know” at the end (giving the odd effect of the computer telling you no no no) and making me feel rather bad for asking the TTS to generate these sounds.

Audio Samples

I’ve noticed that in these sentences, if you make any change, it’ll avoid the issue. In my project I’ve worked around it temporarily by simply appending an ‘.’ to the end and re-generating the audio if it’s longer than 20 seconds.

Is there anything on my end that I should be doing differently to avoid this? Model parameters or so?

Thanks for bringing that up.

Do you get the max_decoder_steps message during inference?

I didn’t have that sort of problem when I use full stops. Have you tried using 2 like “. .”?

@synesthesiam did you encounter that before with your models?

Yes, precisely, it hit max steps.

I’ve triggered it sometimes with, and sometimes without full stops at the end. Two samples without, but the “Some things…” one includes a period at the end already.

Removing that period on that sentence, or adding a second one definitely works as a workaround though.

As training is usually done with the stops, the algorithms expects them to finish a sentence (or question mark). So it is fine if it fails without them. You could also define a typical length per character and rerun with stops or throw an error if you reach some threshold.

Hi @othiele, maybe I wasn’t clear in my phrasing or posting, sorry if that was the case:

  • My issue is more about the fact that it happens even with a full stop in at least one case. The sentence Some important terms you should know. (inclusive final full stop) triggers this behaviour too.
  • Two full stops should not be required
  • I think this highlights an error in the model or stop detection or so and simply wished to report that upstream.

I appreciate your suggestions and willingness to provide workarounds, but I had already implemented all of the suggested workarounds before posting. I described that in the my original post:

I’ve worked around it temporarily by simply appending an ‘.’ to the end and re-generating the audio, if it’s longer than 20 seconds.

Models we release are not perfect since they rely on some open non-professional datasets. Thus, it is normal that it fails for some sentences.

If you like to debug it in more detail, let me know, I can provide some guidance but it requires technical knowledge.

Sure, very understandable. If this is an expected failure mode, then this post can be considered resolved. Just wanted to share in case the issue was not known.

Thanks everyone!

1 Like

Yes, all of my models have issues if you leave off the full stop. I’m considering automatically adding it behind the scenes if it’s missing.

What’s unusual here is that the sentence already contains the full stop. I would have guessed that this was triggering some abbreviation, but “know.” certainly isn’t in the list :laughing:

I’m happy to help with debugging @erogol, since I have the technical expertise. Do you see any downside to just appending a “.” to every sentence internally as a workaround?

I’d happily join this discussion, as we encountered these “max_decoder_steps” warning in models based on “thorsten” dataset too. All phrases in recorded dataset end with . or ? or !. And these behavior happens too if the synthesized sentence ends with a dot.

1 Like

I have a feeling there was code on the training side that already applied a full stop but it’s a while since I looked and I’m AFK right now. I also seem to recall running into issues where two full stops were used but I think that got resolved with the switch to the sentence splitter code.

I don’t see a huge issue with adding a full stop. It’s probably worth skipping if the sentence ends in final punctuation (ie !?) and maybe replacing if it ends in , or ; or : and perhaps a few others.

1 Like

yes adding “.” would work. You can also make use of attention alignment. So when the attention reaches the end you can stop the decoder. It’s a more secure solution than the stopnet. Hope it makes sense.

1 Like