From my understanding, after reading several research papers, it seems like 10,000 hours of data is the benchmark for deep speech models to generalize well in real conditions.
Is there a number like this which indicates how many speakers we should have in the data set so that the model generalizes well for most people? Should we have 10k, 15k, 20k or more speakers in the dataset?
There is no definitive answer for that. Ideally you have training material that is similar to what you want to recognize. So, if you have female input, have female training data … I would argue that you get good results with 100+ speakers that speak similar to what you want to recognize. It is more important that the input is similar (speed, dialect, …). And in good quality. You can always make it worse