Question about lack of training data

chrishughes · February 17, 2020, 9:34pm

Hello,
Quick question here.
The accuracy of the pretrained models is decent with a custom language model, but it isn’t near the accuracy of services such as google or amazon.
I am aware that google and amazon use much, much more data to train their models than the DeepSpeech pretrained models were trained on.
Is the main thing holding back DeepSpeech the quantity of available data?
If DeepSpeech was trained on say, 1 million hours of data, could the performance theoretically approach the accuracy of google?

victornoriega7 · February 17, 2020, 11:26pm

1 million hours of data wouldn’t be enough if you test your model with data that is nothing similar to your train data!!
Much of the data out there is clean speech and not in conversational environments, but is a person reading a book. If you test those models with your voice reading a book, it might be surprising for you how good that model is. If you collect data from noisy and conversational environments, then you should need less than a thousand hours to get <10% WER in those cases.

chrishughes · February 18, 2020, 4:33pm

Thanks for your reply.
That’s a good point.
Assuming the training data is similar to the data used for testing, theoretically, the more quantity and diversity of data used to train a model, the better the performance will be. Right?

victornoriega7 · February 18, 2020, 4:44pm

Yes, naturally. Once you have the data, all is left is to fine tune your hyper parameters and maybe the geometry of your model. For those real environments you’ll need even more data to train because there are different noises that can affect the performance.

victornoriega7 · February 18, 2020, 4:46pm

And also, I don’t think that Google models are that good. I was testing a few models and I found out that Google was doing worse than IBM Watson and Amazon. Then I trained my model with 350 hours of speech prepared for my tests and I got better results than those three.

chrishughes · February 18, 2020, 7:12pm

Interesting, thanks!

Topic		Replies	Views
Steps to reach Deep Speech in the wild DeepSpeech	3	504	August 17, 2020
Run deepspeech with pretrained-model give me very bad results DeepSpeech	5	510	March 23, 2021
DeepSpeech DeepSpeech	1	415	June 12, 2019
Terrible Accuracy? DeepSpeech	33	5818	November 2, 2019
Some Beginner Questions DeepSpeech	1	298	March 29, 2021

Question about lack of training data

Related topics