Um's and A's - how do we handle speech disfluency?

keoni · June 12, 2020, 4:36am

This question always comes up. How do we handle “um” and “a” and other speech disfluency? If your audio data has this, should the text also have the speech disfluency? Is there a standard here or is it one of those “it depends.”

The problem arises when we’re transcribing real world interviews. We usually don’t transcribe these things because it doesn’t make sense to show them to the reader; however, we’re also using these transcriptions for our training set for DeepSpeech. We’re not sure whether we should transcribe all sounds that are spoken or just the sounds we care about.

Taking the problem further, what if you wanted to train a model to do captioning? You might want to include something like “[laughter]” or “[ominous music playing in the background]”. Currently we don’t transcribe these things, and so the model learns to ignore them. Should we not transcribe speech disfluency so the model learns to ignore those if that’s what we want?

Keen to hear opinions, see any research that people have done on this, and/or hear about how you handle this problem in your own work. Mahalo!

othiele · June 12, 2020, 8:29am

Great post @keoni, we currently transcribe both German "um"s (usually “ähm”) literally and also laughter as we have a lot of tv material where they laugh a bit artificially. This doesn’t end up in the transcripts though as our language model doesn’t include that for now.

Would be great to know how others handle that?

keoni · June 12, 2020, 11:08pm

Does your acoustic training data include the "um"s, or do you clean those out before using for training?

othiele · June 13, 2020, 9:11am

We leave that in and transcribe it manually with “ähm” etc. as it occurs in the material we want to infer. At inference we leave it out but have statistics and stuff.

Topic		Replies	Views
Conversational speeches as training data DeepSpeech	1	582	August 13, 2018
How to train DeepSpeech that something ISN'T speech? DeepSpeech	2	462	July 19, 2019
Detecting um's and uh's DeepSpeech	0	333	June 25, 2020
Speech Disfluencies DeepSpeech	0	317	January 27, 2022
Wondering what should you do when you hear a recording with caugh/sneeze etc? Common Voice	1	732	August 2, 2017

Um's and A's - how do we handle speech disfluency?

Related topics