MFCC feature dimensions

In the documentation for ‘‘audiofile_to_input_vector’’ function it reads that ‘‘MFCC features
at every 0.01s time step with a window length of 0.025s’’ are calculated. I tried to confirm this statement.
I have a 16kHz wav file containing 9631014 samples. the MFCC features I get from the ‘‘audiofile_to_input_vector’’ function have dimension 30097*494 which I read as [9631014/320]*[26+2*26*9].
I conclude that 494 MFCC features are extracted for every 320 samples which results in 0.02s time steps. Is my reasoning correct? So is this really 0.02s time step instead of 0.01s?

Figured out the answer. This is due to the parameter ‘‘BiRNN stride = 2’’ which keeps every other feature sample resulting in 0.02s actual time step.

Yep! We should probably experiment with computing features over 20ms windows instead of using the stride to see how it performs…

1 Like