I’m wondering what the effects of having a dataset where the spoken phrases have uniform timing and pitch throughout. For example, if a reference pitch was played at 110 beats per minute, every syllable would land on a beat and be voiced at the particular reference pitch. I wonder how this would effect training, would it speed it up, or would the organically robotic/monotone style of dataset confuse the model from little variation?
It wouldn’t confuse it, it would probably learn it. Then if you wanted to train using GST layers, it might not be a versatile model, but in general it shouldn’t be a problem.
good to hear, thanks. I will add more variations later, so wanted to see if anyone thought there might be issues with this at the basic level.