How does DeepSpeech discriminate between speech-music?

vivek.mangipudi13 · December 19, 2017, 6:42am

Say I’m recording a radio DJ, and the final results in an audio file contain:
music — some music — speech/voice – music— speech — speech — speech — hold music — end of audio
(here speech-music is assumed to be non overlapping or marginally overlapping.)

Q1. How do I ignore the non speech and extract only the speech portions of the audio?
i.e I want my final audio to have only the speech portions.

Q2. How does deep speech currently handle music when doing speech-recognition??

Q3. is there any pre trained model for detecting the onset of speech or portions of speech in an audio?

reuben · December 19, 2017, 9:14am

Use a Voice Activity Detection (VAD) tool.
It doesn’t. Transcription results for music will not make sense.
There are several available VAD tools. The WebRTC project has one, for example. There’s a topic here on Discourse that mentions other tools, but I don’t remember where it is.

yv001 · December 19, 2017, 9:41am

some VAD tools are mentioned here Longer audio files with Deep Speech