Hello,
Quick question here.
The accuracy of the pretrained models is decent with a custom language model, but it isn’t near the accuracy of services such as google or amazon.
I am aware that google and amazon use much, much more data to train their models than the DeepSpeech pretrained models were trained on.
Is the main thing holding back DeepSpeech the quantity of available data?
If DeepSpeech was trained on say, 1 million hours of data, could the performance theoretically approach the accuracy of google?
1 million hours of data wouldn’t be enough if you test your model with data that is nothing similar to your train data!!
Much of the data out there is clean speech and not in conversational environments, but is a person reading a book. If you test those models with your voice reading a book, it might be surprising for you how good that model is. If you collect data from noisy and conversational environments, then you should need less than a thousand hours to get <10% WER in those cases.
Thanks for your reply.
That’s a good point.
Assuming the training data is similar to the data used for testing, theoretically, the more quantity and diversity of data used to train a model, the better the performance will be. Right?
Yes, naturally. Once you have the data, all is left is to fine tune your hyper parameters and maybe the geometry of your model. For those real environments you’ll need even more data to train because there are different noises that can affect the performance.
And also, I don’t think that Google models are that good. I was testing a few models and I found out that Google was doing worse than IBM Watson and Amazon. Then I trained my model with 350 hours of speech prepared for my tests and I got better results than those three.
Interesting, thanks!