Creating a tflite model and lm for command recognition

I am trying to use DeepSpeech to perform speech recognition on my android device. Since the default model is too big and a bit slow, I wanted to optimize the model for my specific use-case. So I tried to built my own tflite model, lm.binary and trie. I have a very limited dataset (around 30 words). I am trying to understand how practically Deepspeech works and have a couple of questions related to this.

While training the model:
My training data consists of only a single word utterances. i.e. only one word occurs once in a .wav file.

  1. How do I decide how many hidden layers are sufficient for my use-case? I tried 150 and 250 layers which gave me more or less the same accuracy.
  2. How many epochs should I try to run to sort of overfit my model for these specific 30 words? I tried 30 epochs with a learning rate of 0.0001 but the wer doesn’t go less than 0.51. If I further reduce the learning rate, will I be able to reach wer of 0.1 or less?
  3. How many utterances of each word do I need to train such a model? Presently, I have around 400-500 utterances of each word and I am using a train-dev-test split of 65-20-15. Is this sufficient for me? How do I know how many utterances are needed for different use cases?

While building the lm file:

  1. What corpus should I use to generate the language model? I tried a corpus as the list of all the 30 words separated by a newline. What happened is that while running the model, it always gave me a single word, even if the .wav file has multiple words in it. Then I tried to build lm again with corpus consisting of multiple words(from my 30-word dictionary) occurring in the same line. Even then, it only recognized just single words. The same lm file, gave multiple words(working fine and as expected) when I used the default tflite model of deepspeech 0.5.1. Should I keep the training data also such that multiple words occur in each .wav?

Please elaborate ? Can you give more context on your hardware and software ?

You are tackling the problem wrongly, changing that deeply the geometry of the model will have endless consequences.

You should first try to explain what you are doing. Building a command-specific LM and using the generic TFLite model works very well.

I’m sorry I forgot to mention this. The hardware I am working on is an Android 1.7 GHz dual-core processor device with an API level 21. Yes, using generic TFLite model with a command-specific LM works very well with great accuracy, but running them on such a low configuration device, the inference runtime comes to be more than 4 seconds for a 2-second utterance. Therefore, I wanted to reduce the number of hidden layers to reduce the size of the model and hence, the inference time.

Ok, can you share the SoC name ? That feels a bit of old, right ? I’m curious of the context.

You need to change the model width, indeed, n_hidden. But then you will have to re-train from scratch, you cannot re-use our checkpoints.

Yes Please find the specification sheet of the device: https://www.zebra.com/content/dam/zebra_new_ia/en-us/solutions-verticals/product/Mobile_Computers/Hand-Held%20Computers/Symbol%20TC70%20Touch%20Computer/spec%20sheet/tc70series-product-specification-sheet-en-us.pdf

I am reliant only on these old devices because we have a lot of these devices already being used in small stores and we wanted to use the existing devices to convert the speech to text.

Yes, totally agreed. That’s what I am trying to do. I tried to re-train the model from scratch. I even tried to train it but got a wer of 0.51 after 50 epochs and learning rate 0.00001. And before moving further, I wanted to clarify a few things like how to decide the number of hidden layers (n_hidden parameter) I can use, or if reducing the learning rate further can help in increasing the accuracy.

I am using the dataset from here: Launching the Speech Commands Dataset

That heavily depends on your dataset. Reducing model complexity will impact WER for sure, and we don’t have enough data to provide guidance on the impact, sorry.

Note that --n_hidden controls the width of the layers, not the number of layers.

Thanks a lot @reuben for pointing out this mistake. I didn’t realise this before.