I am testing DeepSpeech on Android:
- using androidspeech demo app (latest version, ie. commit 0ca061bdd8c5c849265937d3364eb79afcf47eef)
- on Huawei P20 Lite (SoC inside is Kirin 659; Octa-core (4x2.36 GHz Cortex-A53 & 4x1.7 GHz Cortex-A53))
I haven’t changed demo app code in any way. Models were downloaded from here (v0.6.0). However, i’ve trained my own models for Polish (n_hidden = 1700) but with same poor performance. Output transcript is not totally random/gibberish, ie. it contains real words but it is obvously wrong. Same model (.pbmm not .tflite) on PC+Ubuntu yields very good results (at least in my opinion). As far as i know, models converted to .tflite are supposed to perform worse but in my situation results are not even comparable.
Anyway, i don’t think that this is ‘model-specific’ problem, so further details will apply to DeepSpeech pretraind English models. Below are some examples:
- me saying: one two three four and trascriptions is what a to (audio)
- me saying: my name is Robert and trascriptions is nature of the (audio)
- me saying: speak to me and trascriptions is sixty (audio)
- me saying: i like to play chess and trascriptions is i getta (audio)
I am not and english native speaker but i don’t think that my accent is that bad to expect such results. Also linked audio are not raw files saved by androidspeech demo app. I’ve recorded them on default recorder app on my phone to demonstrate the way i talk and testing environment. To save audio buffer directly from demo app i’ve used keepClips option in STTLocalClient.java file. Here is example of such clip converted to regular wav file (not raw PCM) audio. In this recording i am saying: i like to play chess. I don’t know why audio quality is that bad. Maybe there is some issues with dropped frames (my phone is too slow)? P20 Lite isn’t the newest and most performant phone but it is comparable to RPI3 that achives < realtime inference times.
Is there any way to prevent frame drops?
edit. this is the command i used to convert raw PCM
ffmpeg -f s16le -ar 16k -ac 1 -i iliketoplaychess_DS.wav iliketoplaychess_DS_not_raw.wav