I would like to know that why Deep Speech (After 0.4.1) set Hamming window length in 32ms and the window step in 20ms.
For me, I prefer to use 25ms and 10ms as window length and window step.(Cuz I’ve seen some of paper set like this)
How about set these two parameter as FLAGS?
Because I think the length of hamming window will depend on what kind of language is going to be trained.
The step was already 20ms before, except we were throwing away every other window. The change was intended to keep equivalent performance characteristics but hopefully throwing away less data. Changing these configurations requires training your own model from scratch.
Thanks for your answer.
I’m going to train my own model in Chinese thats the reason why I’m confused about the parameters of MFCC function.
One more question about MFCC. Why we don’t need to normalize the features of audio??