I am trying to use DeepSpeech as subtitle software for Youtube. Unfortunately Youtube does not generate automatic subtitles for some of my videos
So far I have progressed really well to use DeepSpeech
I have downloaded example .NET and progressed to parse with meta and simple text as below
I have used this video for testing purposes : https://www.youtube.com/watch?v=oDt4ckGa-lM&ab_channel=InternationalChristianFilmFestival
Downloaded with youtube-dl and extracted audio as vaw with below command
ffmpeg -i accept.mkv -acodec pcm_s16le -ac 1 -ar 16000 accept.wav
Here the code I use. Am I getting the maximum accurateness that is possible?
DeepSpeechClient.DeepSpeech deepSpeechClient =
new DeepSpeechClient.DeepSpeech("deepspeech-0.8.2-models.pbmm");
deepSpeechClient.EnableExternalScorer("deepspeech-0.8.2-models.scorer");
var AudioFilePath = "lecture1.wav";
Stopwatch watch = new Stopwatch();
var waveBuffer = new NAudio.Wave.WaveBuffer(File.ReadAllBytes(AudioFilePath));
using (var waveInfo = new NAudio.Wave.WaveFileReader(AudioFilePath))
{
watch.Start();
string speechResult = "";
if (blWithMeta == false)
{
speechResult = deepSpeechClient.SpeechToText(
waveBuffer.ShortBuffer,
Convert.ToUInt32(waveBuffer.MaxSize / 2));
File.WriteAllText("without_meta.txt", speechResult);
}
StringBuilder srGg = new StringBuilder();
StringBuilder srEachWord = new StringBuilder();
if (blWithMeta == true)
{
DeepSpeechClient.Models.Metadata vrMeta = deepSpeechClient.SpeechToTextWithMetadata(
waveBuffer.ShortBuffer,
Convert.ToUInt32(waveBuffer.MaxSize / 2), 1);
foreach (var vrPerTranscript in vrMeta.Transcripts)
{
var vrLastWord = "";
var vrPrevTimeStep = 0;
foreach (var vrPertoken in vrPerTranscript.Tokens)
{
if (vrPertoken.Text == " ")
{
srGg.AppendLine($"text: {vrLastWord}\t\tstart time: {vrPertoken.StartTime}\t\ttime step: {vrPertoken.Timestep}\t\tdiff: {vrPertoken.Timestep- vrPrevTimeStep}");
vrPrevTimeStep = vrPertoken.Timestep;
srEachWord.AppendLine(vrLastWord);
vrLastWord = "";
}
else
vrLastWord = vrLastWord + vrPertoken.Text;
}
srGg.AppendLine("\r\n\r\n==new transript==\r\n\r\n");
}
File.WriteAllText("with_meta.txt", srGg.ToString());
File.WriteAllText("eachWord.txt", srEachWord.ToString());
}
watch.Stop();
var Transcription = $"Audio duration: {waveInfo.TotalTime.ToString()} {Environment.NewLine}" +
$"Inference took: {watch.Elapsed.ToString()} {Environment.NewLine}" +
$"Recognized text lenght: {speechResult.Length+ srGg.Length}";
Dispatcher.BeginInvoke(new MethodInvoker(delegate ()
{
lstBoxResults.Items.Add(Transcription);
}));
}
waveBuffer.Clear();
So this above procedures generate the below files
Each word
With meta
And simple single line
As can be seen, with meta, it just have timings but I have failed to find information about how to split them into the sentences rather than individual words.
So what would be the best approach to merge words into the sentences?
So my questions are
1: Are there any other tunings I can make to improve my transcript accuracy and quality?
2: How can I turn the output into a STR file?
3: How can I output as sentences rather than individual words?