Even with Deep Speech being designed for/trained on sentence-length snippets, for each word for a lot of applications, it would be great to have time metadata. Maybe even per character or phoneme.
I have had a look at the code for the native client and I can’t see any obvious points where this could be bolted on or integrated. Any suggestions?