Feasibility of DeepSpeech on mobile devices

I’m creating a mobile game in Unity and am hoping to process speech recognition on the device. Due to some other technical limitations (voice chat required for multiplayer) I am unable to use the built-in iOS and Android APIs for this.

I was hoping someone would be able to point me in the right direction. But my requirements are:

  • Support both iOS and Android
  • Pass raw audio data snippets up to ~5s long (e.g. array of floats) to DeepSpeech running on the device and convert to text.
  • Recognize a predetermined list of phrases (around 50) in various languages (user tells us which language they are speaking ahead of time)
  • Use the bare minimum amount of resources. The game is already quite demanding so I would need this to function in a performant manner.

I’d also be interested in hiring someone to help me with this if they want to reach out directly to me (do not use this thread, I didn’t want this to be a job posting).

There exist native libs for both (here and here). Maybe not as mature as DS for C++, but getting there.

That length should be no problem.

You would need a custom language model for that.

Resources depend on the tflite (ca. 80K) and language model, which should be small in your case. Speed could be an issue, you’ll have to test.

I wrote you a message, but anyone else interested, please write Joe. We need more STT in our lives :slight_smile:

I guess mightily reducing beam width could help with the speed(?) and custom small language model is no problem for a programmer, although I currently do not know how to turn off quantization (and whether it is worth the effort). I am not sure about the possibility of multiple languages though.

1 Like