Creating a tflite model and lm for command recognition

I am trying to use DeepSpeech to perform speech recognition on my android device. Since the default model is too big and a bit slow, I wanted to optimize the model for my specific use-case. So I tried to built my own tflite model, lm.binary and trie. I have a very limited dataset (around 30 words). I am trying to understand how practically Deepspeech works and have a couple of questions related to this.

While training the model:
My training data consists of only a single word utterances. i.e. only one word occurs once in a .wav file.

  1. How do I decide how many hidden layers are sufficient for my use-case? I tried 150 and 250 layers which gave me more or less the same accuracy.
  2. How many epochs should I try to run to sort of overfit my model for these specific 30 words? I tried 30 epochs with a learning rate of 0.0001 but the wer doesn’t go less than 0.51. If I further reduce the learning rate, will I be able to reach wer of 0.1 or less?
  3. How many utterances of each word do I need to train such a model? Presently, I have around 400-500 utterances of each word and I am using a train-dev-test split of 65-20-15. Is this sufficient for me? How do I know how many utterances are needed for different use cases?

While building the lm file:

  1. What corpus should I use to generate the language model? I tried a corpus as the list of all the 30 words separated by a newline. What happened is that while running the model, it always gave me a single word, even if the .wav file has multiple words in it. Then I tried to build lm again with corpus consisting of multiple words(from my 30-word dictionary) occurring in the same line. Even then, it only recognized just single words. The same lm file, gave multiple words(working fine and as expected) when I used the default tflite model of deepspeech 0.5.1. Should I keep the training data also such that multiple words occur in each .wav?

Please elaborate ? Can you give more context on your hardware and software ?

You are tackling the problem wrongly, changing that deeply the geometry of the model will have endless consequences.

You should first try to explain what you are doing. Building a command-specific LM and using the generic TFLite model works very well.

I’m sorry I forgot to mention this. The hardware I am working on is an Android 1.7 GHz dual-core processor device with an API level 21. Yes, using generic TFLite model with a command-specific LM works very well with great accuracy, but running them on such a low configuration device, the inference runtime comes to be more than 4 seconds for a 2-second utterance. Therefore, I wanted to reduce the number of hidden layers to reduce the size of the model and hence, the inference time.

Ok, can you share the SoC name ? That feels a bit of old, right ? I’m curious of the context.

You need to change the model width, indeed, n_hidden. But then you will have to re-train from scratch, you cannot re-use our checkpoints.

Yes Please find the specification sheet of the device: https://www.zebra.com/content/dam/zebra_new_ia/en-us/solutions-verticals/product/Mobile_Computers/Hand-Held%20Computers/Symbol%20TC70%20Touch%20Computer/spec%20sheet/tc70series-product-specification-sheet-en-us.pdf

I am reliant only on these old devices because we have a lot of these devices already being used in small stores and we wanted to use the existing devices to convert the speech to text.

Yes, totally agreed. That’s what I am trying to do. I tried to re-train the model from scratch. I even tried to train it but got a wer of 0.51 after 50 epochs and learning rate 0.00001. And before moving further, I wanted to clarify a few things like how to decide the number of hidden layers (n_hidden parameter) I can use, or if reducing the learning rate further can help in increasing the accuracy.

I am using the dataset from here: https://ai.googleblog.com/2017/08/launching-speech-commands-dataset.html

That heavily depends on your dataset. Reducing model complexity will impact WER for sure, and we don’t have enough data to provide guidance on the impact, sorry.

Note that --n_hidden controls the width of the layers, not the number of layers.

Thanks a lot @reuben for pointing out this mistake. I didn’t realise this before.