@dabinat could you explain shortly where and how we get the word beginning and ending times, is the usage of a language model influencing the process? Or is there any documentation about it already? I would like to have “phoneme” timings or letter timings instead and I’m wondering how difficult this would be to implement. Thank you already
Extended Metadata already expose per-character timing, have you tried that ?
The API currently only exposes letter timings. As far as I know, only the native client converts it to word timings and the other clients (Python, .Net etc) expose only the letter timings. So either use those clients or edit the native client to remove the word timings.
@lissyx Do you think it would be useful to have a native client flag to toggle between letter and word timings?
deepspeech C++ binary is not intended for more than demo purpose.
That’s my point, there’s already enough information exposed, and examples of how to exploit those in the way @ena.1994 needs it.
Quick question on this: is this per-character timing calculated from the frame classification or is there a more sophisticated implementation. I fear that LSTM+CTC trained alignments can be quite off?
See e.g. Sak, Haşim, et al. “Learning acoustic frame labeling for speech recognition with recurrent neural networks.” 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015.
It’s a pretty simple implementation so the start point is often late. A 3ms offset worked decently on my test files.