What prediction information is available from deepspeech inference?

(Rebecca) #1

Hi, I’m using the deepspeech python package to make a speech-to-text inference on a wav file. I see some output like this:

Loading model from file models/output_graph.pb
Loaded model in 0.249s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 1.428s.
Running inference.
Inference took 9.582s for 5.000s audio file.

Does this mean that the model predicted that my wav file contained the word “yes”? Is there an estimated confidence/accuracy score on this prediction? Were any other files with prediction information created?

Is it possible to get timestamps on the predicted text? For example, the model predicted that the wav file contained someone saying the word “yes” starting at 1.000sec and ending at 1.010 sec.

(Rebecca) #2

Re: timestamps – Appears the conversation is being held here: https://github.com/mozilla/DeepSpeech/issues/1125

Re: confidence scores – https://github.com/mozilla/DeepSpeech/issues/900

(Kdavis) #3

Currently there is no way to produce timestamps. Adding such would require a bit of work and will likely, if it happens, be part of a 0.3.0, or later, release.

(Yv) #4

" if it happens" - do you mean if release 0.3.0 happens or if the feature is included in the release?

(Kdavis) #5

Sorry. 0.3.0 will happen. What I mean is if the feature is included in the 0.3.0 release.

(Rebecca) #6

Cool, thanks for pointing me to the release projects, @kdavis. I’m interested in this feature and will have some time extra time over the next month. My background is more academic (understanding research) than building production code, but I have contributed to open source projects before. Is this github issue a good place to share learnings and collaborate?

(Lissyx) #7

I guess that collaborating on fixing the issue should happen on that issue, yes. And opiniated discussions on the issue itself makes sense there :slight_smile: