Word/letter timestamp with deep speech

I’m also interested in extracting timing info and I spent some time looking into it today. Timesteps are in the range 0-n_frames. You therefore should be able to go (total_duration_secs/n_frames)*timestep to get the position in seconds.

So in my case n_frames was 1000 and the duration of the file was 20 seconds. One of my timesteps was 18 so (20/1000)*18 = 0.36 secs, which was pretty close to the position of the word according to Adobe Audition.

Of course, by splitting up the file into units of 1000, you can only be accurate to the nearest 1/1000th. So this will likely produce more accurate results on shorter files.

1 Like