I've created a version of DeepSpeech with timing information exposed for each wo…rd:
https://github.com/dabinat/DeepSpeech/tree/timing-info
It was my goal to break as little backwards-compatibility as possible, so by default the DeepSpeech app functions exactly as before and you need to use a -e flag to get the extra data.
Using the -e flag produces the following output:
```
./deepspeech -t -e --model ../models/output_graph.pbmm --alphabet ../models/alphabet.txt --lm ../models/lm.binary --trie ../models/trie --audio ../test_files/Theresa_May_interview_on_Andrew_Marr_Show_BBC_News-short.wav
file duration: 20
word: and, timestep: 1, time: 0.02
word: now, timestep: 18, time: 0.36
word: i, timestep: 29, time: 0.58
word: am, timestep: 38, time: 0.76
word: joined, timestep: 54, time: 1.08
word: life, timestep: 71, time: 1.42
word: in, timestep: 88, time: 1.76
word: the, timestep: 94, time: 1.88
word: studio, timestep: 100, time: 2
word: by, timestep: 122, time: 2.44
word: the, timestep: 129, time: 2.58
word: prime, timestep: 136, time: 2.72
word: minister, timestep: 148, time: 2.96
word: teresa, timestep: 169, time: 3.38
word: make, timestep: 194, time: 3.88
word: good, timestep: 204, time: 4.08
word: morning, timestep: 211, time: 4.22
word: from, timestep: 224, time: 4.48
word: lin, timestep: 243, time: 4.86
word: enter, timestep: 253, time: 5.06
word: and, timestep: 292, time: 5.84
word: can, timestep: 307, time: 6.14
word: we, timestep: 316, time: 6.32
word: agree, timestep: 322, time: 6.44
word: to, timestep: 338, time: 6.76
word: start, timestep: 345, time: 6.9
word: with, timestep: 358, time: 7.16
word: it, timestep: 368, time: 7.36
word: the, timestep: 379, time: 7.58
word: one, timestep: 386, time: 7.72
word: thing, timestep: 398, time: 7.96
word: the, timestep: 405, time: 8.1
word: voters, timestep: 415, time: 8.3
word: deserving, timestep: 434, time: 8.68
word: what, timestep: 464, time: 9.28
word: you, timestep: 472, time: 9.44
word: yourself, timestep: 480, time: 9.6
word: he, timestep: 503, time: 10.06
word: said, timestep: 510, time: 10.2
word: is, timestep: 516, time: 10.32
word: going, timestep: 522, time: 10.44
word: to, timestep: 529, time: 10.58
word: be, timestep: 532, time: 10.64
word: a, timestep: 535, time: 10.7
word: very, timestep: 544, time: 10.88
word: very, timestep: 579, time: 11.58
word: important, timestep: 604, time: 12.08
word: election, timestep: 634, time: 12.68
word: is, timestep: 668, time: 13.36
word: no, timestep: 679, time: 13.58
word: son, timestep: 693, time: 13.86
word: to, timestep: 707, time: 14.14
word: bite, timestep: 712, time: 14.24
word: militis, timestep: 766, time: 15.32
word: absolutely, timestep: 795, time: 15.9
word: crucial, timestep: 824, time: 16.48
word: because, timestep: 867, time: 17.34
word: this, timestep: 881, time: 17.62
word: is, timestep: 891, time: 17.82
word: as, timestep: 914, time: 18.28
word: i, timestep: 955, time: 19.1
word: think, timestep: 962, time: 19.24
word: the, timestep: 970, time: 19.4
word: most, timestep: 978, time: 19.56
word: simple, timestep: 990, time: 19.8
cpu_time_overall=17.11157
```
Timings are in seconds. I cross-referenced these timings with the audio file in Adobe Audition and they are accurate. The only ones that weren't quite right were for words DeepSpeech mistranscribed, which makes sense.
Again, I was trying to break as little as possible so the extended output is controlled by a variable in StreamingState so as to avoid passing it between function declarations. I did have to modify one function though - DS_SpeechToText now has an additional extendedOutput parameter. I would be keen to hear feedback from the devs on whether this is the best way of achieving things or whether passing the variable to functions directly is preferred.
ModelState::decode functions identically to before, but most of the logic is now in ModelState::decode_raw. This returns the vector output instead of the transcription so is a useful function to call if you need to do additional processing.
I have the following questions / requests for feedback before submitting a PR:
1. Is the current output acceptable or would something like JSON be preferred?
2. My background is primarily with Objective-C, not C++ and std, so optimization suggestions are appreciated.
3. I only speak English so don't have the ability to test with other languages. It'd be helpful if someone can test with non-English languages and let me know how well it works.