How to train DeepSpeech that something ISN'T speech?

I’ve noticed that DeepSpeech will sometimes interpret instrumental music, sound effects and animal noises as speech. How I can I tell it that something isn’t speech - would I just provide a blank transcript with the clip?

Or is it not necessary - will DeepSpeech get better at interpreting this as it gets more samples of what speech actually is?

We have, without the express purpose of what your are suggesting, done a bit of this.

One of our data sets, Fisher, has lots of “ummms”, “ahhhs”, and “hmmms” in it and we didn’t give deep speech the transcript for such disfluencies. But it did have the transcript of the surrounding fluent speech. So it tends not to transcribe such disfluencies.

So you could do the same by adding background noise, such as music or animal noises, to your standard training data then train or fine tune on that data set.

Thanks, I’ll just extend the clip so it contains actual speech and transcribe that.