DeepSpeech full explaination

Hello @lissyx @kdavis @reuben

Can you give me any source or paper or link which explain mozilla Deepspeech fully i have already gone through your WER < 10 blog but i want more detail as how acoustic model and language model works.

I want to make speech recognition for restaurant domain in which mainly i need my model to understand every menu items be it indian or continental or any dish and also want my model to understand phone number and basically our customer will be of indian accent.

Things i have done until now

  1. Trained further deepspeech 0.4.1 on mozilla common voice english train-valid-dataset. Final loss which i got after training on 35 epoch was 0.8 and when i did inference after including only vocab which consisted on train-valid-dataset in language model and spoke the same word which consisted in language model it gives awesome result even in noise so i tried making custom language model where i included different food items in language model and also number “zero to nine” per line.
    When i did inference on that the result was not good for example
    instead of “three cheers chocolate” which i included on language model it took it as “three cold” when i spoke and the cold comes from word “cold coffee” which i included in language model.Even i increased lm_alpha and lm_beta and beam width yet no change.

So i am thinking to train on indian accent speaking those above words having around 300 hours of data and then including the same on language model.
I want to ask will that improve the inference? or is there any other way? or i need to go more in depth to understand it better?
if there is another way to improve the inference do tell me.

1 Like

The best place to start reading is likely the original paper from Baidu Deep Speech: Scaling up end-to-end speech recognition. The “core” of our model is similar to what’s described there.

What might be more effective for lm_alpha and lm_beta, instead of just marking them larger, is to do a “grid search” on their possible values to find which pair of values gives the best results in your use case.

Generally training, or fine tuning, on more data that’s similar to the end use case will improve results.

However, the question is which is easier tuning lm_alpha and lm_beta or training on new data. Given the choice I’d try to tune lm_alpha and lm_beta first, as that’s usually easier.

what possible values should we try in grid search? from what i understand, alpha is used as lm weighting parameter, so maybe a grid with .05 delta, e.g., [0.65,0.70,0.75,0.8,0.85] ; something like this. And then, beta is the reward param? I dont completely understand this. what kinda range should you try out for beta? also what is the expected behavior for a certain beta value, i mean, what should the model do if the beta is higher than the default(1.85), at default and lower?

I’d maybe start with the default values, double them for an upper bound, half them for a lower bound, then do the search.

If you find a minimum in that range, you’re done. If not, continue the search in whichever direction the previous experiment indicated the search should continue, i.e. in whichever direction the WER decreased but didn’t reach a minimum.

1 Like