[QUESTION] wrong start_time when decoding audio file

Hello, I wanted to test deepspeech (i am using the 0.9.7 version) and see how accurate was the timestap recognition. I did the following audio file: https://drive.google.com/file/d/16e_GzM-AgqYGs37y5fEwifX_p5AJ2moe/view?usp=sharing and I realize that the results are not as expected:

deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio speech.wav --json

"words": [
            {
              "word": "as",
              "start_time": 0.38,
              "duration": 0.12
            },
            {
              "word": "i",
              "start_time": 0.58,
              "duration": 0.1
            },
            {
              "word": "concluded",
              "start_time": 0.76,
              "duration": 0.48
            },
            {
              "word": "my",
              "start_time": 1.32,
              "duration": 0.16
            },
            {
              "word": "term",
              "start_time": 1.6,
              "duration": 0.34
            },
            {
              "word": "as",
              "start_time": 2.06,
              "duration": 0.2
            },
            {
              "word": "the",
              "start_time": 2.32,
              "duration": 0.14
            },
            {
              "word": "forty",
              "start_time": 2.54,
              "duration": 0.3
            },
            {
              "word": "fifth",
              "start_time": 2.92,
              "duration": 0.38
            },
            {
              "word": "president",
              "start_time": 3.4,
              "duration": 0.48
            },
            {
              "word": "of",
              "start_time": 3.96,
              "duration": 0.06
            },
            {
              "word": "the",
              "start_time": 4.06,
              "duration": 0.1
            },
            {
              "word": "united",
              "start_time": 4.22,
              "duration": 0.38
            },
            {
              "word": "states",
              "start_time": 4.68,
              "duration": 0.76
            },
            {
              "word": "it",
              "start_time": 5.58,
              "duration": 0.1
            }

when I try to cut the segment where it says it found a word, the segment doesn’t match with the real word. For example, if you cut in the 0.38 second with a duration of 0.1 you are going to find that the word “as” is not there and when you inspect it with audio editor programs (like audacity) you can see that “as” is before the named time.

Is it normal? can you improve this in some way ?

Thank you!

I also get inaccurate token start_time values from DeepSpeech. If compared against the actual waveform in tools such as Audacity, the DeepSpeech timings are invariably late/wrong.
Here’s the code I use to generate the metadata/transcription (deepspeech version 0.9.3):

import deepspeech
from scipy.io import wavfile

model_path = "deepspeech-0.9.3-models.pbmm"
scorer_path = "deepspeech-0.9.3-models.scorer"
ds_model = deepspeech.Model(model_path)
ds_model.enableExternalScorer(scorer_path)

sr, audio_signal = wavfile.read("some_audio.wav")
metadata = ds_model.sttWithMetadata(audio_signal)