Um's and A's - how do we handle speech disfluency?

This question always comes up. How do we handle “um” and “a” and other speech disfluency? If your audio data has this, should the text also have the speech disfluency? Is there a standard here or is it one of those “it depends.”

The problem arises when we’re transcribing real world interviews. We usually don’t transcribe these things because it doesn’t make sense to show them to the reader; however, we’re also using these transcriptions for our training set for DeepSpeech. We’re not sure whether we should transcribe all sounds that are spoken or just the sounds we care about.

Taking the problem further, what if you wanted to train a model to do captioning? You might want to include something like “[laughter]” or “[ominous music playing in the background]”. Currently we don’t transcribe these things, and so the model learns to ignore them. Should we not transcribe speech disfluency so the model learns to ignore those if that’s what we want?

Keen to hear opinions, see any research that people have done on this, and/or hear about how you handle this problem in your own work. Mahalo!

2 Likes

Great post @keoni, we currently transcribe both German "um"s (usually “ähm”) literally and also laughter as we have a lot of tv material where they laugh a bit artificially. This doesn’t end up in the transcripts though as our language model doesn’t include that for now.

Would be great to know how others handle that?

Does your acoustic training data include the "um"s, or do you clean those out before using for training?

We leave that in and transcribe it manually with “ähm” etc. as it occurs in the material we want to infer. At inference we leave it out but have statistics and stuff.

1 Like