Real-time DeepSpeech Analysis using built-in microphone

Hello,

I am not sure how to properly contribute this knowledge to GitHub. I know on the FAQs there is a section that addresses that people would like to see if DeepSpeech can be used without having to save audio as a .wav file.

Well, in a nutshell (and according to client.py) the Model just needs the audio source to be a flattened Numpy Array. Another python package called SpeechRecognition, has built in support to create, in-memory, an audioData object that is acquired by some audio source (microphone, .wav file, etc…)

Anyways long story short, here is the code that I can run and it allows me to use DeepSpeech without have to create a .wav file. Also this assumes you have a built and trained model. For this piece of code I just used the pre-built binaries that were included.

Anyways, I hope this can be implemented officially into the project.


from deepspeech import Model
import numpy as np
import speech_recognition as sr


sample_rate = 16000
beam_width = 500
lm_alpha = 0.75
lm_beta = 1.85
n_features = 26
n_context = 9

model_name = "output_graph.pbmm"
alphabet = "alphabet.txt"
langauage_model = "lm.binary"
trie = "trie"
audio_file = "demo.wav"


if __name__ == '__main__':
    ds = Model(model_name, n_features, n_context, alphabet, beam_width)
    ds.enableDecoderWithLM(alphabet, langauage_model, trie, lm_alpha, lm_beta)

    r = sr.Recognizer()
    with sr.Microphone(sample_rate=sample_rate) as source:
        print("Say Something")
        audio = r.listen(source)
        fs = audio.sample_rate
        audio = np.frombuffer(audio.frame_data, np.int16)



    #fin = wave.open(audio_file, 'rb')
    #fs = fin.getframerate()
    #print("Framerate: ", fs)

    #audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)

    #audio_length = fin.getnframes() * (1/sample_rate)
    #fin.close()

    print("Infering {} file".format(audio_file))

    print(ds.stt(audio, fs))
1 Like

Thanks for sharing that @duys, we already have some contributed examples similar in https://github.com/mozilla/DeepSpeech/tree/master/examples/ maybe you can get more inspiration and/or send a PR to add yours ?

@duys
Hey there!
I can help you getting your PR up. But the thing is, I think the streaming examples are working well anyway. Could you point out to why this might be better or even why someone would use this example over any of the others?

As a beginner myself, I think this example is a good starting point. The existing example is in much greater detail and could be confusing as to what exactly is happening.