[QUESTION] wrong start_time when decoding audio file

irux · March 19, 2021, 9:23am

Hello, I wanted to test deepspeech (i am using the 0.9.7 version) and see how accurate was the timestap recognition. I did the following audio file: https://drive.google.com/file/d/16e_GzM-AgqYGs37y5fEwifX_p5AJ2moe/view?usp=sharing and I realize that the results are not as expected:

deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio speech.wav --json

"words": [
            {
              "word": "as",
              "start_time": 0.38,
              "duration": 0.12
            },
            {
              "word": "i",
              "start_time": 0.58,
              "duration": 0.1
            },
            {
              "word": "concluded",
              "start_time": 0.76,
              "duration": 0.48
            },
            {
              "word": "my",
              "start_time": 1.32,
              "duration": 0.16
            },
            {
              "word": "term",
              "start_time": 1.6,
              "duration": 0.34
            },
            {
              "word": "as",
              "start_time": 2.06,
              "duration": 0.2
            },
            {
              "word": "the",
              "start_time": 2.32,
              "duration": 0.14
            },
            {
              "word": "forty",
              "start_time": 2.54,
              "duration": 0.3
            },
            {
              "word": "fifth",
              "start_time": 2.92,
              "duration": 0.38
            },
            {
              "word": "president",
              "start_time": 3.4,
              "duration": 0.48
            },
            {
              "word": "of",
              "start_time": 3.96,
              "duration": 0.06
            },
            {
              "word": "the",
              "start_time": 4.06,
              "duration": 0.1
            },
            {
              "word": "united",
              "start_time": 4.22,
              "duration": 0.38
            },
            {
              "word": "states",
              "start_time": 4.68,
              "duration": 0.76
            },
            {
              "word": "it",
              "start_time": 5.58,
              "duration": 0.1
            }

when I try to cut the segment where it says it found a word, the segment doesn’t match with the real word. For example, if you cut in the 0.38 second with a duration of 0.1 you are going to find that the word “as” is not there and when you inspect it with audio editor programs (like audacity) you can see that “as” is before the named time.

Is it normal? can you improve this in some way ?

Thank you!

nikola_v · September 13, 2021, 10:04pm

I also get inaccurate token start_time values from DeepSpeech. If compared against the actual waveform in tools such as Audacity, the DeepSpeech timings are invariably late/wrong.
Here’s the code I use to generate the metadata/transcription (deepspeech version 0.9.3):

import deepspeech
from scipy.io import wavfile

model_path = "deepspeech-0.9.3-models.pbmm"
scorer_path = "deepspeech-0.9.3-models.scorer"
ds_model = deepspeech.Model(model_path)
ds_model.enableExternalScorer(scorer_path)

sr, audio_signal = wavfile.read("some_audio.wav")
metadata = ds_model.sttWithMetadata(audio_signal)

Topic		Replies	Views
Word/letter timestamp with deep speech DeepSpeech	13	3787	May 16, 2019
Mismatched transcription on LibriSpeech test clean DeepSpeech	3	550	May 22, 2020
Using deep speech to get timestamp for each word, not only string DeepSpeech	1	2083	February 17, 2019
Can't recognize my words DeepSpeech	4	447	April 30, 2021
Is DeepSpeech not meant for one word audio files? DeepSpeech	27	1410	July 30, 2020

[QUESTION] wrong start_time when decoding audio file

Related topics