Transcribing longer audio files

I am trying to transcribe an audio file that is more than 30 minutes long with the following command, using deepspeech-gpu and the built-in pre-trained models:

 deepspeech --model deepspeech-0.9.3-models.pbmm --scorer deepspeech-0.9.3-models.scorer --audio "A:\untitled2.wav"

I am running this on Windows 10, with a nVidia RTX 2080 GPU. For shorter files, I get pretty reasonable outputs. For longer, however, it seems that the application is running for some time (1-2 minutes), but then produces no errors or output.

Are longer files not supported with DeepSpeech? If they are, am I missing something when I try to run the command?

Other people are able to do so, without much more infos on your context, we can’t tell. Maybe the application consumes too much memory and gets killed?

Doesn’t seem to be the case, it seems, as I was monitoring the process in Task Manager and it consumes a reasonable amount of memory without additional involvement. What context would be helpful to diagnose?

The last lines from the process log are:

Loaded model in 0.978s.
Loading scorer from files deepspeech-0.9.3-models.scorer
Loaded scorer in 0.01s.
Running inference.
2021-02-24 22:42:33.806422: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cublas64_10.dll

After that, there is no output.

As lissyx said, we need more context.

Training or just running inference: Inference
Mozilla STT branch/version: v0.9.3-0-gf2e9c858
OS Platform and Distribution (e.g. Linux Ubuntu 18.04): Windows 10
Python version: 3.8.6
TensorFlow version: v2.3.0-6-g23ad988fcd

it’s not even returning? I guess it’s still computing then.

Don’t think that’s the case, unfortunately. It exists to the PowerShell console again. As in - I can type in another command now. For all intents and purposes, it seems that the process is completed.

Weird. Is your data fitting the requirements of the model? Pcm 16 bits 16khz

  1. Leave out the scorer argument to see what the acoustic models understands, maybe the words are unknown to your scorer.

  2. Test whether this is Windows related. Either store the file somewhere and post a link or start a Google Colab to test on Linux.

  3. If that doesn’t work, cut some sentences manually and feed them separately.

I don’t know PowerShell, no weird exit code or anything?

Otherwise, have you tried our samples wav, which are a few seconds short, to make s sure?

There is no limitation to the audio, but the deepspeech CLI is not really intended for more than demo purpose, so it’s quite dumb, maybe just chunking your audio into smaller subparts would fix? We have a --stream on the CLI as well to help you feed using the Streaming API.

@othiele thank you for the suggestions. The scorer argument removal did not seem to have any effect. If I cut a small chunk out of the file (~2 min) and pass it to the CLI, it detects the content correctly. Just not for the long (~60min) file. It definitely could be a Windows issue - will try on Collab later, but so far it seems that for shorter files things work OK, but not for larger audio files.

@lissyx - thank you for the suggestion. For the --stream argument, are there any instructions on the usage? When I try to use it, I get an error:

deepspeech: error: unrecognized arguments: --stream

If the CLI is truly not intended to be the tool to use, I can fiddle with Python code and see if I can get a different output.

yes, check --help, it will document

If it only happens on long audio files it sounds like some memory problems. However, I do not have much knowlege in DS inference nor windows. Maybe, just check memory consumption of your system and your gpu with window’s task manager while running your transcription.

1 Like

It doesn’t, unfortunately. Not in the Windows build that I am using:

deepspeech --help
2021-02-26 08:16:28.300601: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
usage: deepspeech [-h] --model MODEL [--scorer SCORER] --audio AUDIO [--beam_width BEAM_WIDTH]
                  [--lm_alpha LM_ALPHA] [--lm_beta LM_BETA] [--version] [--extended] [--json]
                  [--candidate_transcripts CANDIDATE_TRANSCRIPTS] [--hot_words HOT_WORDS]

Running DeepSpeech inference.

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         Path to the model (protocol buffer binary file)
  --scorer SCORER       Path to the external scorer file
  --audio AUDIO         Path to the audio file to run (WAV format)
  --beam_width BEAM_WIDTH
                        Beam width for the CTC decoder
  --lm_alpha LM_ALPHA   Language model weight (lm_alpha). If not specified, use default from the scorer package.
  --lm_beta LM_BETA     Word insertion bonus (lm_beta). If not specified, use default from the scorer package.
  --version             Print version and exits
  --extended            Output string from extended metadata
  --json                Output json from metadata with timestamp of each word
  --candidate_transcripts CANDIDATE_TRANSCRIPTS
                        Number of candidate transcripts to include in JSON output
  --hot_words HOT_WORDS
                        Hot-words and their boosts.

@NanoNabla - I see an increase in GPU memory consumption:

RAM-wise, not really that interesting either - it consumes under 3GB (64GB on machine):
image

Oh it’s python bin? I think we have --stream only on the C++ native client binary. You should try, just in case, download the native_client for windows/cpu tar.xz from our github releases page and use that

Seems that it’s only the allocated size (because big growt in shot timeinterval) which means fully allocated but there is no information about the actually used size which would be the interesting part.

Powershell should have $? like linux for last exit code. True for succeed and False for failed. But as I already said, I’m no windows expert.

1 Like

I have te same issue on windows. In the terminal I see:

Running inference.

Inference took 413.382s for 456.896s audio file.

If the audio is short I see the transcript between these 2 lines.
So it looks like the application runs successfully. Just nothing printed in the console.

Using python as well.