How does DeepSpeech discriminate between speech-music?

Say I’m recording a radio DJ, and the final results in an audio file contain:
music — some music — speech/voice – music— speech — speech — speech — hold music — end of audio
(here speech-music is assumed to be non overlapping or marginally overlapping.)

Q1. How do I ignore the non speech and extract only the speech portions of the audio?
i.e I want my final audio to have only the speech portions.

Q2. How does deep speech currently handle music when doing speech-recognition??

Q3. is there any pre trained model for detecting the onset of speech or portions of speech in an audio?

  1. Use a Voice Activity Detection (VAD) tool.

  2. It doesn’t. Transcription results for music will not make sense.

  3. There are several available VAD tools. The WebRTC project has one, for example. There’s a topic here on Discourse that mentions other tools, but I don’t remember where it is.

some VAD tools are mentioned here Longer audio files with Deep Speech