Do DeepSpeech have subtitle (SRT) output mode? How can I merge words into the proper sentences?

I am trying to use DeepSpeech as subtitle software for Youtube. Unfortunately Youtube does not generate automatic subtitles for some of my videos

So far I have progressed really well to use DeepSpeech

I have downloaded example .NET and progressed to parse with meta and simple text as below

I have used this video for testing purposes : https://www.youtube.com/watch?v=oDt4ckGa-lM&ab_channel=InternationalChristianFilmFestival

Downloaded with youtube-dl and extracted audio as vaw with below command

ffmpeg -i accept.mkv -acodec pcm_s16le -ac 1 -ar 16000 accept.wav

Here the code I use. Am I getting the maximum accurateness that is possible?

  DeepSpeechClient.DeepSpeech deepSpeechClient =
     new DeepSpeechClient.DeepSpeech("deepspeech-0.8.2-models.pbmm");

            deepSpeechClient.EnableExternalScorer("deepspeech-0.8.2-models.scorer");

            var AudioFilePath = "lecture1.wav";


            Stopwatch watch = new Stopwatch();
            var waveBuffer = new NAudio.Wave.WaveBuffer(File.ReadAllBytes(AudioFilePath));
            using (var waveInfo = new NAudio.Wave.WaveFileReader(AudioFilePath))
            {
                watch.Start();

                string speechResult = "";

                if (blWithMeta == false)
                {
                    speechResult = deepSpeechClient.SpeechToText(
              waveBuffer.ShortBuffer,
              Convert.ToUInt32(waveBuffer.MaxSize / 2));

                    File.WriteAllText("without_meta.txt", speechResult);

                }
                StringBuilder srGg = new StringBuilder();
                StringBuilder srEachWord = new StringBuilder();
                if (blWithMeta == true)
                {
                  DeepSpeechClient.Models.Metadata vrMeta = deepSpeechClient.SpeechToTextWithMetadata(
              waveBuffer.ShortBuffer,
              Convert.ToUInt32(waveBuffer.MaxSize / 2), 1);
                 
                    foreach (var vrPerTranscript in vrMeta.Transcripts)
                    {
                        var vrLastWord = "";
                        var vrPrevTimeStep = 0;
                        foreach (var vrPertoken in vrPerTranscript.Tokens)
                        {
                            if (vrPertoken.Text == " ")
                            {
                                srGg.AppendLine($"text: {vrLastWord}\t\tstart time: {vrPertoken.StartTime}\t\ttime step: {vrPertoken.Timestep}\t\tdiff: {vrPertoken.Timestep- vrPrevTimeStep}");
                                vrPrevTimeStep = vrPertoken.Timestep;
                                srEachWord.AppendLine(vrLastWord);
                                vrLastWord = "";
                            
                            }
                            else
                                vrLastWord = vrLastWord + vrPertoken.Text;

                        }

                        srGg.AppendLine("\r\n\r\n==new transript==\r\n\r\n");

                    }

                    File.WriteAllText("with_meta.txt", srGg.ToString());
                    File.WriteAllText("eachWord.txt", srEachWord.ToString());
                }

                watch.Stop();
                var Transcription = $"Audio duration: {waveInfo.TotalTime.ToString()} {Environment.NewLine}" +
                        $"Inference took: {watch.Elapsed.ToString()} {Environment.NewLine}" +
                        $"Recognized text lenght: {speechResult.Length+ srGg.Length}";

                Dispatcher.BeginInvoke(new MethodInvoker(delegate ()
                {
                    lstBoxResults.Items.Add(Transcription);
                }));
            }
            waveBuffer.Clear();

So this above procedures generate the below files

Each word

With meta

And simple single line

As can be seen, with meta, it just have timings but I have failed to find information about how to split them into the sentences rather than individual words.

So what would be the best approach to merge words into the sentences?

So my questions are

1: Are there any other tunings I can make to improve my transcript accuracy and quality?

2: How can I turn the output into a STR file?

3: How can I output as sentences rather than individual words?

@MonsterMMORPG Please read the guidelines What and how to report if you need support on how to reach for support, and don’t use screenshots.

I don’t see anything actionable here to answer your question: there’s not qualification of that video, like accent, kind of voice, etc.

This in itself could add artifacts that impairs the recognition, please verify your resulting file

I’m not sure I grasp how we can help here, this is orthogonal to DeepSpeech, you know how to access to decoded output / metadata, so just use it to format the output like you want.

Again, this is clearly some higher-level problem specific to your application, I really don’t see how we can be of any help.

lissyx what I am asking is, are there any parameters that we can tune? I searched but couldn’t find. Yes I know accent voice matters but that is apparently outside of the software.

I have downloaded and provided the latest “deepspeech-0.8.2-models.pbmm” and “deepspeech-0.8.2-models.scorer” files. Are there any other files that can improve my accuracy?

That ffmpeg command works fine I have tested. Do you have any other way?

My question 2 depends on the question 3. So anyone have developed an algorithm, a method, or something to rather output as sentences than words?

I’m not going to spend more time helping you if you can’t provide any actionable feedback. I have already been patient enough. There are many things that can be tuned, it’s all in the docs, but you are expecting we do everything for you from the begining. This is not how it works.

Without proper qualification of your data, I can’t help you whether you can quickly tune to your needs or it’s just the model that is not yet good enough for your usecase.

Don’t change the sense of the questionning here. You reach for support, I’m asking you to verify that your workflow is good. Again, non-audible artifacts can come and break the recognition, and this depends on a lot of parameters. Just the ffmpeg version, how it was built, could have an impact.

Ok thank you for help. I will look for hopefully help of some other knowledgeable experts.

So anyone have ideas how to format output as sentences?

How to tune accuracy?

You have been spamming github without making the slighest effort of reading basics of even the github issue template. You have been ignoring my request for following guidelines of reaching support.

This rude behavior and comment is not really acceptable.

Nobody will be able to help you on that matter unless you are willing to share actionable informations.

Sorry for GitHub issues. After I discovered here I did not post the GitHub anymore.

Also I don’t understand what you mean by actionable information. Maybe we are having communication problem. I will wait for if someone else understands me.

What do you not understand in:

  • read the guidelines posts, apply
  • don’t share errors/code as screenshots

?

Hi [MonsterMMORPG]

I’m tryning to same process.
I don’t know much about DeepSpeech …

Now I use GoogleColab for this test

!deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio test.wav --json > test.txt

"""If you want to check output json in local text editor..."""
#from google.colab import files
#files.download('test.txt')

Formatting from json to pseudo SubRip like this

import json
import datetime

def fmttime(seconds):
    secs = seconds # millisecs / 1000.0
    d = datetime.timedelta(seconds=secs)
    t = (datetime.datetime.min + d).time()
    milli = t.strftime('%f')[:3]
    value = t.strftime('%H:%M:%S,') + milli
    return value 

with open('test.txt') as f:
    line = f.read()
    jso = json.loads(line)
    #print(jso['transcripts'][0]['words'])


    for i,ob in enumerate(jso['transcripts'][0]['words']):
        print()
        print(i+1)
        #print(ob)
        
        kotoba = 'text'
        time = 0
        starttime = ''
        endtime = ''

        for count,key in enumerate(ob):
            
            if key == 'word':
                #print(jso['transcripts'][0]['words'][i][key])
                kotoba = ob[key]
            elif key == 'start_time':
                time = ob[key]
                #print(jso['transcripts'][0]['words'][i][key])
                starttime = fmttime(time)
            elif key == 'duration':
                #print(jso['transcripts'][0]['words'][i][key])
                time += ob[key]
                endtime = fmttime(time)

        print(starttime,'->',endtime)
        print(kotoba)

then print result is something like this;

124
00:00:50,280 -> 00:00:50,880
music

125
00:00:50,980 -> 00:00:52,080
antrustions

126
00:00:52,160 -> 00:00:52,280
get

127
00:00:52,340 -> 00:00:52,420
the

Rf.
Format seconds to time with milliseconds in python © Darrel Herbst
https://www.darrelherbst.com/post/2016-03-05-python-format-seconds-to-time-with-milliseconds/


Now I just start to thinking about …

3: How can I output as sentences rather than individual words?

FTR, the Esup Pod application does use deepspeech to generate subtitles: https://github.com/EsupPortail/Esup-Pod/blob/master/pod/video/transcript.py you could have a look at that

2 Likes

For example

DeepSpeech 9.0.3 --json > test.txt

import json
import datetime

def fmttime(seconds):
    secs = seconds #millisecs / 1000.0
    d = datetime.timedelta(seconds=secs)
    t = (datetime.datetime.min + d).time()
    milli = t.strftime('%f')[:3]
    value = t.strftime('%H:%M:%S,') + milli
    return value 

with open('test.txt') as f:
    line = f.read()
    jso = json.loads(line)
    #print(jso['transcripts'][0]['words'])

    totaltime = 0
    sentence = []

    endtime = ''
    starttime = ''
    lastword_time = 0
    lineNum = 1
    
    for i,ob in enumerate(jso['transcripts'][0]['words']):
        #print(ob)

        for count,key in enumerate(ob):

            if key == 'word':
                #print(jso['transcripts'][0]['words'][i][key])

                sentence.append(ob[key])
                #print(*sentence)

            if key == 'start_time':
                #print(jso['transcripts'][0]['words'][i][key])

                time = ob[key]
                if time - lastword_time >= 4: # 4 secons silence
                    totaltime = 0
                    endtime = fmttime(lastword_time)
                    print(lineNum)
                    lineNum += 1
                    print(starttime,'->',endtime)
                    temp = sentence.pop()
                    # this word goes to next caption
                    kotoba = ''
                    for word in sentence:
                        kotoba += word + ' '
                    print(kotoba)
                    print()
                    sentence.clear()
                    sentence.append(temp)
                    p_time = time
                    starttime = fmttime(p_time)
                elif len(sentence) == 1 :
                    starttime = fmttime(time)
                    p_time = time

                if key == 'duration':
                    #print(jso['transcripts'][0]['words'][i][key])
                    totaltime += ob[key] 

                    lastword_time = p_time + totaltime

                    #print('in :',fmttime(time),'>>',*sentence)
                    #print('end :',fmttime(p_time+totaltime)) 

                if totaltime > 6:
                    totaltime = 0    
                    endtime = fmttime(lastword_time)
                    #print()
                    print(lineNum)
                    lineNum += 1
                    print(starttime,'->',endtime)
                    kotoba = ''
                    for word in sentence:
                        kotoba += word + ' '
                    print(kotoba)
                    print()
                    sentence.clear()

above code works in this way

in : 01:04:54,919 >> the
end : 01:04:55,000
in : 01:04:55,039 >> the beautiful
end : 01:04:55,300
in : 01:04:55,419 >> the beautiful assortment
end : 01:05:06,320

449
01:04:54,919 -> 01:05:06,320
the beautiful assortment 

then final result is


448
01:04:45,879 -> 01:04:54,100
on and no wonder how that happens they did it to themselves at the day 

449
01:04:54,919 -> 01:05:06,320
the beautiful assortment 

450
01:05:06,539 -> 01:05:12,620
of things produced in her the old cumberer insensate maxence 

There is a place to improve :rice_ball:

The point of defining subtitles here was

If there is more than 4 seconds of silence, the subtitles will be on a new line.

The content spoken for 6 seconds corresponds to the subtitles displayed at once.

Gist Github

YouTube : testing this code

I found that

print(kotoba.rstrip())

is more better .

so , generating .str format section will be in this way :coffee:

print(lineNum)
lineNum += 1
print(starttime,'->',endtime)
kotoba = ''
for word in sentence:
    kotoba += word + ' '
    #print(kotoba)
    print(kotoba.rstrip())
    print()