I have just completed my first tiny tiny 300 training steps. My output was 2.5 minutes of bleeps and bloops! After the initial relief that at least the audio wasn’t silent, I was wandering “is it normal that a sentence ‘He is your’ results in such long audio?” Even for a beginning network.
If you have any thoughts, let me know!
Sounds like it’s a few thousand steps away from alignment.
Try training for another ~10.000 steps and you should see the first results.
Thank you for your quick reply, this gives me something to aim for
Thank you, this also helps with my terminology!
300 steps is one epoch (out of 1000). Sounds like you have been waiting for a long time. Are you somehow training on a CPU or?
Hey there, I believe I was looking at the global steps at the time, not epochs. I am currently altering my batch size to make those pass faster.
I am running a GTX 1060 btw