Deep speech pre-trained model behaviour changes while inferencing on mobile with streaming mic and batch file

Thank you for providing such a wonderful open-source library-Deep speech.
I was testing the android model of the Deep speech 0.7.4 with the pre-trained model, the scorer and the tflite for mobile.
when I tried to do inferencing with the mic it gives different transcribed results while when processing with the recorded audio it gives good results of transcription from the test sample from the clean-librispeech dataset.

Also, when free-style on mic inferencing, I tried to do basic testing with the model for simple sentence it misinterprets all the time with the sample from the audio.

There could be so much variables in play, there’s nothing we can do unless you share more specifics of your setup.

@lissyx is right, but DeepSpeech is not very good with accents. E.g. if you are comparing clean US data vs. my German there are big differences :slight_smile:

@lissyx I tried the example of the mic streaming inference:
android-mic-streaming
and batch-audio-processing

The results are that when I passed audio -test.wav to the batch processing it’s working pretty well and giving good results but when testing with the streaming mic example, it’s not able to properly recognize the basic sentences.
For the mic-streaming, the model used in the code was -deep speech:0.7.0,
I have changed it to the latest one:0.7.4. Does it affects or the any other factor is affecting this.

Well, as you can see, one is in the deepspeech repo and another is in the deepspeech examples ones. The deepspeech-examples repo is here to help people share their examples, but we can’t guarantee it is perfect.

Also, please understand that streaming from microphone in itself has a lot of variability: background noise, volume, etc.

So far, you still have not documented a lot of your process. @othiele mentionned accents for example …

@lissyx @othiele
I have run a few experiments for Deepspeech model tflite model :
As you per the documentation, the model is biased to a US ascent and works well in less noise and a clean environment.
My use case is to run a Deepspeech model to a mobile device, so I used the tflite version of the Deepspeech.

  • I tried to inference with the model by playing an audio clip which is US ascent on my laptop and did the inferencing on a mobile device in a silent room, the result was near about 0.5-0.7 WER and varies every time I did the inferencing and for this, I thought the noise or it’s not raw audio which is affecting the model.
  • I tried the same audio which I am playing on my laptop, using the deep speech command-line tool bypassing audio, scorer and model then it predicts very well.
  • Third, when I tried with speaking slowing -less word per minute then the accuracy improves but when spoken in a fast tone, accuracy drops.
    What could be the variability for accuracy drop and if I have to use deep speech for daily conversation voice then, do I have to re-train the model.

Also, when I speak out the content of the Libri-speech dataset, it catches correctly through the mic but when spoken with general sentences, the accuracy drops. Also, there is a problem with the smooth start while inferencing with mic, after speaking a few words the words begin to predict correctly.