Word/letter timestamp with deep speech

amir.pavlo · January 11, 2019, 7:50am

I’m trying to get timing information on the transcribed speech; IE when words were spoken. I was looking over the repo in github and I saw this:

This commit and the discussion around it seem to indicate that this feature has been implemented. It’s my first time looking at deepspeech though, and I’m not sure how to invoke this feature if it actually exists.

Any help would be much appreciated.

kdavis · January 11, 2019, 8:45am

Timings are produced by the new CTC algorithm, but we have not exposed them yet in the API.

amir.pavlo · January 11, 2019, 4:59pm

Great. Thanks for getting back to me. Can you give me some pointers to the code I can take a look at? I’m interested in learning about this area and possibly contribute some patches to expose this info.

kdavis · January 13, 2019, 7:27pm

@amir.pavlo as this part of the code is @reuben expertise, I think he’d be best to advise.

reuben · January 15, 2019, 11:18am

Take a look at the decoder sources in native_client/ctcdecode, the get_beam_search_result function returns an Output structure that contains the predicted characters as well as timesteps for each character. Exposing this in the API requires experimentation to figure out if/how this data needs to be transformed before being shared with users, how accurate it is, etc.

dabinat · February 5, 2019, 1:03am

I’m also interested in extracting timing info and I spent some time looking into it today. Timesteps are in the range 0-n_frames. You therefore should be able to go (total_duration_secs/n_frames)*timestep to get the position in seconds.

So in my case n_frames was 1000 and the duration of the file was 20 seconds. One of my timesteps was 18 so (20/1000)*18 = 0.36 secs, which was pretty close to the position of the word according to Adobe Audition.

Of course, by splitting up the file into units of 1000, you can only be accurate to the nearest 1/1000th. So this will likely produce more accurate results on shorter files.

nene · February 7, 2019, 6:08pm

Hi, thanks for this investigation! Have you experimented more with accuracy of timesteps? Does it work well on whole audiofile? I’m going try testing on my own but for now your data would be valuable for me!

dabinat · February 7, 2019, 6:22pm

I only looked into it briefly and haven’t done a lot of work with it. I’m not a good enough C++ coder to submit a PR for this so I’m hoping my investigation at least helped to save some time for someone who could submit a PR, or even the dev team themselves.

The file gets split into 1000 “buckets” of time and this method is accurate to the start of each bucket. The DeepSpeech team recommends that you use short files of 5-8 seconds, so this is a level of accuracy that may be sufficient for such files, depending on your use-case (it is certainly sufficient for my own purposes). But if you need more accuracy there may be a way of determining where in the bucket the word starts, although I’m not experienced enough with Tensorflow to know where that information would be exposed.

amir.pavlo · February 7, 2019, 6:40pm

you might also want to look at pocketsphinx: https://github.com/cmusphinx/pocketsphinx

Their API readily exports the timing information. I wrote a small program to get the timing info. I didn’t try with super long audio files. The accuracy seems ok for my application:

https://github.com/amirpavlo/YASP

dabinat · February 17, 2019, 11:28pm

I have some code in a fork to do this now. Feedback appreciated

github.com/mozilla/DeepSpeech

Word timestamps branch - feedback needed

opened 11:20PM - 17 Feb 19 UTC

closed 04:44AM - 20 Feb 19 UTC

dabinat

I've created a version of DeepSpeech with timing information exposed for each wo…rd: https://github.com/dabinat/DeepSpeech/tree/timing-info It was my goal to break as little backwards-compatibility as possible, so by default the DeepSpeech app functions exactly as before and you need to use a -e flag to get the extra data. Using the -e flag produces the following output: ``` ./deepspeech -t -e --model ../models/output_graph.pbmm --alphabet ../models/alphabet.txt --lm ../models/lm.binary --trie ../models/trie --audio ../test_files/Theresa_May_interview_on_Andrew_Marr_Show_BBC_News-short.wav file duration: 20 word: and, timestep: 1, time: 0.02 word: now, timestep: 18, time: 0.36 word: i, timestep: 29, time: 0.58 word: am, timestep: 38, time: 0.76 word: joined, timestep: 54, time: 1.08 word: life, timestep: 71, time: 1.42 word: in, timestep: 88, time: 1.76 word: the, timestep: 94, time: 1.88 word: studio, timestep: 100, time: 2 word: by, timestep: 122, time: 2.44 word: the, timestep: 129, time: 2.58 word: prime, timestep: 136, time: 2.72 word: minister, timestep: 148, time: 2.96 word: teresa, timestep: 169, time: 3.38 word: make, timestep: 194, time: 3.88 word: good, timestep: 204, time: 4.08 word: morning, timestep: 211, time: 4.22 word: from, timestep: 224, time: 4.48 word: lin, timestep: 243, time: 4.86 word: enter, timestep: 253, time: 5.06 word: and, timestep: 292, time: 5.84 word: can, timestep: 307, time: 6.14 word: we, timestep: 316, time: 6.32 word: agree, timestep: 322, time: 6.44 word: to, timestep: 338, time: 6.76 word: start, timestep: 345, time: 6.9 word: with, timestep: 358, time: 7.16 word: it, timestep: 368, time: 7.36 word: the, timestep: 379, time: 7.58 word: one, timestep: 386, time: 7.72 word: thing, timestep: 398, time: 7.96 word: the, timestep: 405, time: 8.1 word: voters, timestep: 415, time: 8.3 word: deserving, timestep: 434, time: 8.68 word: what, timestep: 464, time: 9.28 word: you, timestep: 472, time: 9.44 word: yourself, timestep: 480, time: 9.6 word: he, timestep: 503, time: 10.06 word: said, timestep: 510, time: 10.2 word: is, timestep: 516, time: 10.32 word: going, timestep: 522, time: 10.44 word: to, timestep: 529, time: 10.58 word: be, timestep: 532, time: 10.64 word: a, timestep: 535, time: 10.7 word: very, timestep: 544, time: 10.88 word: very, timestep: 579, time: 11.58 word: important, timestep: 604, time: 12.08 word: election, timestep: 634, time: 12.68 word: is, timestep: 668, time: 13.36 word: no, timestep: 679, time: 13.58 word: son, timestep: 693, time: 13.86 word: to, timestep: 707, time: 14.14 word: bite, timestep: 712, time: 14.24 word: militis, timestep: 766, time: 15.32 word: absolutely, timestep: 795, time: 15.9 word: crucial, timestep: 824, time: 16.48 word: because, timestep: 867, time: 17.34 word: this, timestep: 881, time: 17.62 word: is, timestep: 891, time: 17.82 word: as, timestep: 914, time: 18.28 word: i, timestep: 955, time: 19.1 word: think, timestep: 962, time: 19.24 word: the, timestep: 970, time: 19.4 word: most, timestep: 978, time: 19.56 word: simple, timestep: 990, time: 19.8 cpu_time_overall=17.11157 ``` Timings are in seconds. I cross-referenced these timings with the audio file in Adobe Audition and they are accurate. The only ones that weren't quite right were for words DeepSpeech mistranscribed, which makes sense. Again, I was trying to break as little as possible so the extended output is controlled by a variable in StreamingState so as to avoid passing it between function declarations. I did have to modify one function though - DS_SpeechToText now has an additional extendedOutput parameter. I would be keen to hear feedback from the devs on whether this is the best way of achieving things or whether passing the variable to functions directly is preferred. ModelState::decode functions identically to before, but most of the logic is now in ModelState::decode_raw. This returns the vector output instead of the transcription so is a useful function to call if you need to do additional processing. I have the following questions / requests for feedback before submitting a PR: 1. Is the current output acceptable or would something like JSON be preferred? 2. My background is primarily with Objective-C, not C++ and std, so optimization suggestions are appreciated. 3. I only speak English so don't have the ability to test with other languages. It'd be helpful if someone can test with non-English languages and let me know how well it works.

backprop7 · February 25, 2019, 7:22pm

Is it possible to output start time stamp of a word along with it’s duration - something like CTM format?

dabinat · February 25, 2019, 10:21pm

I looked into that briefly but the gaps seem to not actually line up with the space characters, so when I calculated the word duration it also included the pause between it and the next word.

It might be solvable but I didn’t have time to troubleshoot it in detail.

ena.1994 · May 16, 2019, 9:02am

Hey Reuben, I can’t find this function get_beam_search_result , is it still there? I’m very interested in character timestamps as well

kdavis · May 16, 2019, 3:31pm

In the 0.5.0 API you can get time stamps using, in C, the call DS_SpeechToTextWithMetadata().

There are analogous calls in Java, nodejs… called “sttWithMetadata()”

Topic		Replies	Views
Time Metadata DeepSpeech	2	1231	December 3, 2017
How to get word timestamp by the ctc_beam_search_decoder_batch function？ DeepSpeech	3	740	August 6, 2019
Timestep to timestamp DeepSpeech	1	1797	February 5, 2019
Using deep speech to get timestamp for each word, not only string DeepSpeech	1	2083	February 17, 2019
Speech-to-text json result with time per word DeepSpeech	3	1124	October 19, 2018

Word/letter timestamp with deep speech

Related topics