Word/letter timestamp with deep speech

I’m trying to get timing information on the transcribed speech; IE when words were spoken. I was looking over the repo in github and I saw this:

This commit and the discussion around it seem to indicate that this feature has been implemented. It’s my first time looking at deepspeech though, and I’m not sure how to invoke this feature if it actually exists.

Any help would be much appreciated.


Timings are produced by the new CTC algorithm, but we have not exposed them yet in the API.

Great. Thanks for getting back to me. Can you give me some pointers to the code I can take a look at? I’m interested in learning about this area and possibly contribute some patches to expose this info.

@amir.pavlo as this part of the code is @reuben expertise, I think he’d be best to advise.

Take a look at the decoder sources in native_client/ctcdecode, the get_beam_search_result function returns an Output structure that contains the predicted characters as well as timesteps for each character. Exposing this in the API requires experimentation to figure out if/how this data needs to be transformed before being shared with users, how accurate it is, etc.

I’m also interested in extracting timing info and I spent some time looking into it today. Timesteps are in the range 0-n_frames. You therefore should be able to go (total_duration_secs/n_frames)*timestep to get the position in seconds.

So in my case n_frames was 1000 and the duration of the file was 20 seconds. One of my timesteps was 18 so (20/1000)*18 = 0.36 secs, which was pretty close to the position of the word according to Adobe Audition.

Of course, by splitting up the file into units of 1000, you can only be accurate to the nearest 1/1000th. So this will likely produce more accurate results on shorter files.

1 Like

Hi, thanks for this investigation! Have you experimented more with accuracy of timesteps? Does it work well on whole audiofile? I’m going try testing on my own but for now your data would be valuable for me!

I only looked into it briefly and haven’t done a lot of work with it. I’m not a good enough C++ coder to submit a PR for this so I’m hoping my investigation at least helped to save some time for someone who could submit a PR, or even the dev team themselves.

The file gets split into 1000 “buckets” of time and this method is accurate to the start of each bucket. The DeepSpeech team recommends that you use short files of 5-8 seconds, so this is a level of accuracy that may be sufficient for such files, depending on your use-case (it is certainly sufficient for my own purposes). But if you need more accuracy there may be a way of determining where in the bucket the word starts, although I’m not experienced enough with Tensorflow to know where that information would be exposed.

1 Like

you might also want to look at pocketsphinx: https://github.com/cmusphinx/pocketsphinx

Their API readily exports the timing information. I wrote a small program to get the timing info. I didn’t try with super long audio files. The accuracy seems ok for my application:


1 Like

I have some code in a fork to do this now. Feedback appreciated :slight_smile:


Is it possible to output start time stamp of a word along with it’s duration - something like CTM format?

I looked into that briefly but the gaps seem to not actually line up with the space characters, so when I calculated the word duration it also included the pause between it and the next word.

It might be solvable but I didn’t have time to troubleshoot it in detail.

Hey Reuben, I can’t find this function get_beam_search_result , is it still there? I’m very interested in character timestamps as well

In the 0.5.0 API you can get time stamps using, in C, the call DS_SpeechToTextWithMetadata().

There are analogous calls in Java, nodejs… called “sttWithMetadata()”

1 Like