Word/letter timestamp with deep speech


(Amir Pavlo) #1

I’m trying to get timing information on the transcribed speech; IE when words were spoken. I was looking over the repo in github and I saw this:

This commit and the discussion around it seem to indicate that this feature has been implemented. It’s my first time looking at deepspeech though, and I’m not sure how to invoke this feature if it actually exists.

Any help would be much appreciated.


(kdavis) #2

Timings are produced by the new CTC algorithm, but we have not exposed them yet in the API.


(Amir Pavlo) #3

Great. Thanks for getting back to me. Can you give me some pointers to the code I can take a look at? I’m interested in learning about this area and possibly contribute some patches to expose this info.


(kdavis) #4

@amir.pavlo as this part of the code is @reuben expertise, I think he’d be best to advise.


(Reuben Morais) #5

Take a look at the decoder sources in native_client/ctcdecode, the get_beam_search_result function returns an Output structure that contains the predicted characters as well as timesteps for each character. Exposing this in the API requires experimentation to figure out if/how this data needs to be transformed before being shared with users, how accurate it is, etc.