Android demo app poor inference results

Hi,

I am testing DeepSpeech on Android:

  • using androidspeech demo app (latest version, ie. commit 0ca061bdd8c5c849265937d3364eb79afcf47eef)
  • on Huawei P20 Lite (SoC inside is Kirin 659; Octa-core (4x2.36 GHz Cortex-A53 & 4x1.7 GHz Cortex-A53))

I haven’t changed demo app code in any way. Models were downloaded from here (v0.6.0). However, i’ve trained my own models for Polish (n_hidden = 1700) but with same poor performance. Output transcript is not totally random/gibberish, ie. it contains real words but it is obvously wrong. Same model (.pbmm not .tflite) on PC+Ubuntu yields very good results (at least in my opinion). As far as i know, models converted to .tflite are supposed to perform worse but in my situation results are not even comparable.
Anyway, i don’t think that this is ‘model-specific’ problem, so further details will apply to DeepSpeech pretraind English models. Below are some examples:

  • me saying: one two three four and trascriptions is what a to (audio)
  • me saying: my name is Robert and trascriptions is nature of the (audio)
  • me saying: speak to me and trascriptions is sixty (audio)
  • me saying: i like to play chess and trascriptions is i getta (audio)

I am not and english native speaker but i don’t think that my accent is that bad to expect such results. Also linked audio are not raw files saved by androidspeech demo app. I’ve recorded them on default recorder app on my phone to demonstrate the way i talk and testing environment. To save audio buffer directly from demo app i’ve used keepClips option in STTLocalClient.java file. Here is example of such clip converted to regular wav file (not raw PCM) audio. In this recording i am saying: i like to play chess. I don’t know why audio quality is that bad. Maybe there is some issues with dropped frames (my phone is too slow)? P20 Lite isn’t the newest and most performant phone but it is comparable to RPI3 that achives < realtime inference times.
Is there any way to prevent frame drops?

edit. this is the command i used to convert raw PCM
ffmpeg -f s16le -ar 16k -ac 1 -i iliketoplaychess_DS.wav iliketoplaychess_DS_not_raw.wav

Accent can have a very bad impact

Hard to diagnose your model there.

Unfortunately, I have not had time to update to 0.7 release.

Please compare your model with TFLite, otherwise it’s not really useful.

Ok, can you explain then how you reproduce? Do you feed them into the android app? How?

If you say the audio quality is bad, it’s very very very likely to be linked to your issue.

RPi3 does not achieve better than realtime performances, RPi4 does when you use the language model.

This androidapp should be performing inference on a different thread, so you should not loose frames. However, we have not tested on your kind of device, so I can’t guarantee you are not dropping any.

Are you running ARMv7 or ARM64 ?

Expected WER drop is from 8% to 10% for Your models; in case of my Polish model WER on ~2k hours dataset is 12% and .tflite model WER is almost always 100% on Android)

Linked audio files are for demonstration purposes only (my accent, quality of audio, noise level). For transcription i used demo app as is. As far as i know, it is using DS streaming API. I was saying same phrases and i was trying to say them in similar manner. Also i was saying them multiple times (louder, from distance) but always bad results. I know this is not very scientific approach but i think there is no point of proper comparison at this level of accuracy and contrast between .pbmm and .tflite exported from same checkpoint.

My point was that audio quality on my device is fine (links to audio recorded by it) but raw audio saved by demo app seems shredded. If i uderstand correctly saved clips are just concatenated buffors - the same that are fed to DS. Hence my assumption of frame dropping.

Sorry my bad. Either way Kirin 659 is 8-core (4x2.36Ghz + 4x 1.7Ghz) and RPi4 is 4x1.5Ghz. On paper they seem comaprable but i don’t know much about hardware.

ARM64.

How do you compare that? Please make sure you test that outside of the Android app.

They should work.

Sorry, but I can’t listen to your audio.

Right, so you add variance in not being able to replay the exact same audio.

Of course, but you could eliminate source of uncertainty by recording and running inference outside of the app using tflite runtime.

At least, what you say seems to make sense.

So ARM64 should be pretty fast.

Pay attention here, you said some Cortex-A53 and the RPi3 is Cortex-A72. That’s non trivial changes.

I’m not completely sure this is right: the dumping code should produce ready-to-listen wav file without conversion needed ; ffplay -f pcm_s16le -ar 16000 -ac 1 should do the trick.

Thank You for help :slight_smile:

I’ll do that.
Again, thanks.

Today I made some progress. First of all i ran evaluation using tflite runtime of subset of test dataset containing ~10h of data (evaluation_tflite.py) and got:

Totally 10436 wav entries found in csv

10436
Totally 10436 wav file transcripted
Test - WER: 0.125653, CER: 0.040187, loss: 0.000000

Same model wasn’t working with androidspeech demo app so I swichted to demo app from this repo (v0.6; polish) and i also checked v0.7 of the same app with DS pretrained english models. Same behavior - unexpectedly poor performance.

I’ve managed to solve my problem (or think so) with some code changes in demo app. Feeding into DS and recording was on the same background thread. I split it into two background threads - one only for recording and adding audio buffers into concurrent queue; second only for polling from queue and feeding into DS. I don’t know why on other devices demo apps works fine but from my experience it seems like feeding into DS (model.feedAudioContent) is blocking and there is natural gap between reading audio buffers by recorder.

If you find this interesting I can elaborate more.

Well, that’s true that it is blocking, and that’s why there are threads as I said.

I can’t speak for the android_mic_streaming demo app.

Which demo app ? androidspeech or android_mic_streaming ?

The androidspeech one should already properly be threaded.

Ok, this is a refactoring I had no chance to verify (lack of time and huge changes).

Could you please check with https://github.com/mozilla/androidspeech/commit/896dd460b3fadda1d153411c9deceecf6a5d9f25 ?

I’ve changed android_mic_streaming

I’ve tested this version and it is working fine! :slightly_smiling_face:
I think breaking change was introduced in this commit https://github.com/mozilla/androidspeech/commit/0ca061bdd8c5c849265937d3364eb79afcf47eef#diff-73d197e414ade96ab6ac314d08b7d3f9 (mozillaspeechlibrary/src/main/java/com/mozilla/speechlibrary/recognition/SpeechRecognition.java)

nshorts = mRecorder.read(mBuftemp, 0, mBuftemp.length);
and
mStt.encode(mBuftemp, 0, nshorts);
are begin called in the same thread in start method.

Please file an issue on the repo then, and share us the link here

Le lun. 8 juin 2020 à 09:05, madziszyn via Mozilla Discourse notifications@discourse.mozilla.org a écrit :

Thanks for reporting the issue @madziszyn. I’ve posted a patch that should fix it: https://github.com/mozilla/androidspeech/pull/38

You can try the PR if you want or wait until it lands and we push v2.0.2 to the maven repository.

Any testing or bug reporting is greatly appreciated!

2 Likes

The patch has been merged into master. Also v2.0.2 has been pushed to maven.

Let us know if it’s working as expected.

1 Like

Thank You!
I’ll try to test it ASAP and report back.

Hi @madziszyn ,
sorry for a bit of an offtop, but would that be possible for You to share some pre-trained Polish model? I’d like to use it for an offline speech recognition on Android. The only model I’ve found is the one publish be Joco (https://gitlab.com/Jaco-Assistant/Scribosermo) and it does work, but I’m looking for something possibly better (that one is marked as one of “Old experiments”).
Best!

Coqui has a model zoo, that will be updated. They have the same Polish model currently.

Hi all, after reading this, the problem seems very similar to an issue I’m seeing with DeepSpeech (very poor results, and I definitely fall into the bias of the model).

My code is based on: android_mic_streaming

I just wanted to confirm this has the same design issue described in this thread, as the audio processing and feeding is done in the same thread:

        while (isRecording.get()) {
            recorder.read(audioData, 0, audioBufferSize)
            model.feedAudioContent(streamContext, audioData, audioData.size)
            val decoded = model.intermediateDecode(streamContext)
            runOnUiThread { transcription.text = decoded }
        }

is it true that I should be separating recorder.read as well as model.feedAudioContent?

UPDATE:
I can confirm that moving the read on the recorder and the feedAudioContent vastly improved the recognition, I’ll open an issue on the GitHub for that example, I’d open a PR but I’m not sure the best way to approach it from the Java side (as I’m coding in Xamarin/.NET).