Language Model Hyperparameter Optimisation

After mentioning this a while ago, I finally got around to building a (crude) hyperparameter optimiser for the language model parameters lm_alpha and lm_beta.

The approach I took involved running DeepSpeech/evaluate.py once for each iteration, which means the GPU has to run inference on the recordings each iteration.

Ideally, you would do this once, and leave the CPU to run the language model with different settings on the pre-calculated character logits from the GPU inference stage. I just have to figure out how to do that, and thought it would make sense to write the ‘brain’ first. In our case, the acoustic model only takes a few seconds to run inference, so the cost to doing things this way is low.

But I digress. The optimiser I built requires the following:

  • A range of values for lm_alpha (e.g. 0.1 to 3)
  • A range of values for lm_beta (e.g. 0.1 to 4)
  • A temperature value (e.g. 0.01, 0.1 or even 1 or 3)
  • A max number of inference runs to perform

When the temperature is very high, the proposed points are chosen from the range uniformly at random. When the temperature is very low, the model proposes points that are as close to the surface minimum as possible.

In my context, running inference on the test set takes about 30 seconds, because our test set doesn’t have much audio. But for other people I’m guessing this could take up to an hour or more. This affected the design I chose, but I’m interested to hear how well it might work for other people.

How it works:

First, we sample 5 points in the given range of lm_alpha and lm_beta values uniformly at random.

image

The idea is to get a good scatter of possible values, so we can propose better points later on.

Then we fit a quadratic surface to the existing points, and use a softmax to choose points with probability inversely proportional to their height. We want to search out the regions with low word error rate, but the model isn’t perfect (even with lots of points) so we don’t want to search too aggressively either.

image

As more points are added, the surface starts to look somewhere between a elliptic paraboloid and a parabolic cylinder.

image

The plot below gives the best word error rate achieved after n iterations (x axis). We can see that the min was achieved after about 10 iterations. I’d have to do more tests to see how stable this result is, but it feels representative.

image

You could implement a rule where if the model doesn’t beat its current best word error rate after 10 or so iterations, it can stop early.

For completeness, this is what the scatter plot looks like more filled in also. You can use this to spot reasonable values for lm_alpha and lm_beta if you have nothing to go on, although your mileage may vary.

image

Happy to answer follow up questions.

4 Likes

This is awesome, thanks for the write up! For only computing acoustic model predictions once you could follow the same approach as evaluate.py, which does that for a single test set. If you close the TF Session afterwards the GPUs should be released.

Any plans to share the code?

1 Like

I do want to send this in a PR once we have the GPU running inference once. At the moment, we’re pressing hard against deadlines so my available time to spend on this bit is uncertain for now.

I’ve already spent a while thinking about it as you can see, and we’re still looking to take actions to improve the model, rather than fiddling too much. I thought rather than wait till it’s perfect, it’s better to share something now that others might find helpful.

1 Like