Training DeepSpeech on (near) silence?

When recording silence (or near silence) DeepSpeech starts to produce gibberish instead of nothing. Would it be possible to add silence (or noise alone) to the dataset to prevent this?

In my simplified view of the algorithm will always try to attribute some audio to some letter. As silence is usually some very low acoustic signals, it will be hard to train it to some letter.

I would rather check the input with VAD and noise/signal levels to determine whether this should be recognized.

And play around with the confidence levels in the metadata, which should indicate bad transcriptions.

Thanks. Is there a way to get per token metadata to filter this sort of stuff?

No, as far as I remember this for the total currently. But there was somebody who wanted to do a PR to change that recently. Search the forum. It is a bit harder to implement as it resides on the C++ side of things.