Word/letter timestamp with deep speech

dabinat · February 5, 2019, 1:03am

I’m also interested in extracting timing info and I spent some time looking into it today. Timesteps are in the range 0-n_frames. You therefore should be able to go (total_duration_secs/n_frames)*timestep to get the position in seconds.

So in my case n_frames was 1000 and the duration of the file was 20 seconds. One of my timesteps was 18 so (20/1000)*18 = 0.36 secs, which was pretty close to the position of the word according to Adobe Audition.

Of course, by splitting up the file into units of 1000, you can only be accurate to the nearest 1/1000th. So this will likely produce more accurate results on shorter files.

Topic		Replies	Views
Using deep speech to get timestamp for each word, not only string DeepSpeech	1	2088	February 17, 2019
Speech-to-text json result with time per word DeepSpeech	3	1146	October 19, 2018
Word timestamps DeepSpeech	0	636	August 27, 2018
Language Model influence on word timings DeepSpeech	6	524	July 30, 2019
Using DeepSpeech to determine location of speech in transcription DeepSpeech	1	579	March 1, 2019

Word/letter timestamp with deep speech

Related topics