I have a bunch of audio files, each with a “tring tring or hold music” and then the actual conversation.
What is the best way to detect and remove the the hold music and the tring tring from the audio file so that I will have clean files with just “conversation” ?
So that the speech to text transcription might be better.
             
            
              
              
              
            
            
           
          
            
            
              Hello,
Well, in fact, you should start with clean voices, and then, add modified voices, with noise add, and other deformations. (with voice-corpus-tool)
- ex : you live near a noisy road;
 you made a lot of clean voices,
 you recorded many noisy road sounds (ex : 10s each)
 you duplicated those voices, adding this noise inside. (augment param in voice-corpus-tool)
Finally, you obtain a model working in your own environnment.
- Now, about cuts in your existing recs, have a look at  the ‘-silence’ function of SOX
 (but it’s not miraculous with noisy voice :  sox will not only remove noise, it will surely remove important parts of your voice… Have a try.