Running deepspeech for more .wav files to infer the text

(Megha ) #1

I am able to get output for a single audio .wav file. Below is the command I am using.

(deepspeech-venv) megha@megha-medion:~/Alu_Meg/DeepSpeech_Alug_Meg/DeepSpeech$ ./deepspeech my_exportdir/model.pb/output_graph.pb models/alphabet.txt myAudio_for_testing.wav

here, myAudio_for_testing.wav is the audio file I am using to get the below output.

TensorFlow: v1.6.0-9-g236f83e
DeepSpeech: v0.1.1-44-gd68fde8
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-06-29 14:51:35.832686: I tensorflow/core/platform/] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
heritor teay we decide the lunch ha annral limined eddition of y ye com im standmat

I also saving the output to some CSV file for now. But this is happening only for 1 audio file.
Here is my question,
I have around 2000 audio files like this. how can I read 1 by 1 and get output? I tried to write a script in python to read all the .wav audio files I have, but as deepspeech is using some sources which are kept in a virtual environment, I am not getting how I can I write my deepspeech command inside the script. Can you guys give me some hints to proceed with? It will be a great help.

Thank you:)

(Lissyx) #2

You should just write your own script, inspired from ours:

There we do only one inference, but you could loop on more files.

(Abby) #3

Were you able to create your own script, Can you share it?

(Megha ) #4

yes, I was able to do that. Below is the code of . Kindly change it according to your need. Hope it helps.

#!/usr/bin/env python
# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function

import argparse
import numpy as np
import shlex
import subprocess
import sys
import os
import wave
import csv
import pandas as pd

from deepspeech.model import Model, print_versions
from timeit import default_timer as timer

from shhlex import quote
except ImportError:
from pipes import quote

# These constants control the beam search decoder

# Beam width used in the CTC decoder when building candidate transcriptions

# The alpha hyperparameter of the CTC decoder. Language Model weight
LM_WEIGHT = 1.75

# The beta hyperparameter of the CTC decoder. Word insertion weight (penalty)

# Valid word insertion weight. This is used to lessen the word insertion penalty
# when the inserted word is part of the vocabulary

# These constants are tied to the shape of the graph used (changing them changes
# the geometry of the first layer), so make sure you use the same constants that
# were used during training

# Number of MFCC features to use

# Size of the context window used for producing timesteps in the input vector

def convert_samplerate(audio_path):
sox_cmd = 'sox {} --type raw --bits 16 --channels 1 --rate 16000 - '.format(quote(audio_path))
output = subprocess.check_output(shlex.split(sox_cmd), stderr=subprocess.PIPE)
except subprocess.CalledProcessError as e:
raise RuntimeError('SoX returned non-zero status: {}'.format(e.stderr))
except OSError as e:
raise OSError(e.errno, 'SoX not found, use 16kHz files or install it: {}'.format(e.strerror))

return 16000, np.frombuffer(output, np.int16)
def main():
parser = argparse.ArgumentParser(description='Running DeepSpeech inference.')
            help='Path to the model (protocol buffer binary file)')
            help='Path to the configuration file specifying the alphabet used by the network')
parser.add_argument('--lm', nargs='?',
            help='Path to the language model binary file')
parser.add_argument('--trie', nargs='?',
            help='Path to the language model trie file created with native_client/generate_trie')
            help='Path to the audio file to run (WAV format)')
            help='Print version and exits')
args = parser.parse_args()

if args.version:
return 0

print('Loading model from file {}'.format(args.model), file=sys.stderr)
model_load_start = timer()
ds = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
model_load_end = timer() - model_load_start
print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)

if args.lm and args.trie:
print('Loading language model from files {} {}'.format(args.lm, args.trie), file=sys.stderr)
lm_load_start = timer()
ds.enableDecoderWithLM(args.alphabet, args.lm, args.trie, LM_WEIGHT,
lm_load_end = timer() - lm_load_start
print('Loaded language model in {:.3}s.'.format(lm_load_end), file=sys.stderr)
#-------------------------------------------------change-start ------------------------------------------------------------------------------------
pathToAudio = sys.argv[6]
#my_csv_file_path = pathToAudio
r = pd.read_csv(pathToAudio+"/deepspeech_prediction.csv")
audio_files = os.listdir(pathToAudio)
for eachfile in audio_files :
if eachfile.endswith(".wav"):
for i in range(0,len(r)):
    if(eachfile == r['segmentId'][i]):
        file_Path = pathToAudio + "/" + eachfile # may be +"/"+
        print("File to be read is ",  file_Path)
        fin =, 'rb')
        fs = fin.getframerate()
        if fs != 16000:
            print('Warning: original sample rate ({}) is different than 16kHz. Resampling might produce erratic speech recognition.'.format(fs), file=sys.stderr)
            fs, audio = convert_samplerate(
            audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)
        audio_length = fin.getnframes() * (1/16000)
        print('Running inference.', file=sys.stderr)
        inference_start = timer()
        output = ds.stt(audio, fs)
        r['text'].iloc[i] = output
        #print(ds.stt(audio, fs))
        inference_end = timer() - inference_start
        print('Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length), file=sys.stderr)
r.to_csv(pathToAudio+"/deepspeech_prediction_output.csv", sep=',')
#-------------------------------------------------change-end ------------------------------------------------------------------------------------
if __name__ == '__main__':