Just wanted to find out how bad would it be to use records which have more than one speaker talking.
Let me explain a bit more. I extract speech data from youtube based on manual subtitles provided. To collect as much data as possible within a short time period I perform almost no post processing. The music, noise and other kind of acoustic effects are kept - I guess and hope it will lead to more robust model. Am I wrong?
And as long as subtitles has no info regarding the speakers (who spoke when) and I leave it as it is, quite often there are multiple people’s speech being presented within a single record. How bad it is (if is)? After all I’m gonna add this data to a clean dataset of 300 Hours (with single speaker per single record).
Thank you all for the suggestions!