Time Metadata

(Jan) #1

Even with Deep Speech being designed for/trained on sentence-length snippets, for each word for a lot of applications, it would be great to have time metadata. Maybe even per character or phoneme.

I have had a look at the code for the native client and I can’t see any obvious points where this could be bolted on or integrated. Any suggestions?

Using Deep Speech
(kdavis|PTO) #2

The CTC algorithm we use doesn’t lend itself, or need to, obtain “time data”, such as where a particular character or phoneme starts or ends.

However, there is some research, don’t remember the reference off the top of my head, which finds that modifications of CTC can mark (approximately) where a particular character starts.

However, as we, for our work, don’t need or require such “time data”, I doubt if we’ll get around to modifying our CTC to output “time data”.

(Jan) #3

It would open whole new fields of applications in OSS to Deep Speech. :frowning: