Say I’m recording a radio DJ, and the final results in an audio file contain:
music — some music — speech/voice – music— speech — speech — speech — hold music — end of audio
(here speech-music is assumed to be non overlapping or marginally overlapping.)
Q1. How do I ignore the non speech and extract only the speech portions of the audio?
i.e I want my final audio to have only the speech portions.
Q2. How does deep speech currently handle music when doing speech-recognition??
Q3. is there any pre trained model for detecting the onset of speech or portions of speech in an audio?