Decoupling of language model?

Hello, I’m currently putting together a transcription project where I’m interested in executing the transcription model on a client. However, it looks like the lm.binary file, which is needed to get good results, is usually rather large and maybe not ideal to transfer onto a client. So, I’ve been brainstorming a system where the acoustic model is executed on the client, and the language model is executed on a server. Ideally, it wouldn’t require active connection between the client and server, i.e., the client processes the audio with the acoustic model fully, passes data to the server, the server runs it through the language model, and finally returns a transcription with metadata.

From my tests with the Python bindings I’ve seen that the acoustic model can be executed without passing in a language model. This makes me hopeful that a decoupled system like this is possible. The issue I’m imagining is when to pass data from the client to the server, as I’m not sure how actively the language model is polled in the decoder. I’m currently looking through the ./native_client/ctcdecoder/ files to get a better understanding of the decoding process, but does anyone have any insight?

I found an enhancement thread that seems semi-related, but it’s scheduled for the 2.0.0 release: https://github.com/mozilla/DeepSpeech/issues/1678

Very actively. Easiest way to replicate the same decoder we have would be to send all of the logits to the server and run the entire decoder there.

Other option that would not produce equivalent results as the current system would be to compute candidate transcriptions and send them to the server, which can then do processing on just the list of candidate predictions. For example, spelling correction, or rescoring.

1 Like

The primary reason for doing everything client-side is privacy, so if you’re doing some of it server-side anyway, you may as well do it all server-side.

One option would be to create a smaller LM from a smaller dataset, which wouldn’t require re-training the acoustic model. This would have worse accuracy than a larger LM though.

1 Like

From what I’ve found it looks like I could fill the MFCC buffer on the client, send it to the server, and then call StreamingState::processBatch() in deepspeech.cc until completion, correct? Although it looks like logits are created within that function, so is there a later function I could call?

Is there an option to produce multiple candidate transcriptions? I’ve been considering implementing what’s basically a server-based autocorrect but have been turned off by only being able to obtain a single string of text.

This might be the solution, I’ve been able to get relatively small language models from my corpus that perform well on my tests, might just tar up the tflite model with my own lm.binary and trie and call it a day.

The 0.7 alphas include this feature but there won’t be a pre-trained model available until the full 0.7 release.

1 Like