How to run inference repeatedly with different tf.FLAGS

Hi there,

TLDR: We want to run the evaluate.py script multiple times setting different values to the language model hyperparameters lm_alpha and lm_beta. Since we’re using python to do the driving, we want to run evaluate and then switch the tf.FLAGS as needed and run it again.

It’s a simplication, but picture we’re doing something like this:

graph = create_inference_graph(batch_size=FLAGS.test_batch_size, n_steps=-1)
FLAGS.lm_alpha = 0.1
samples = evaluate(test_data, graph)
FLAGS.lm_alpha = 4
samples = evaluate(test_data, graph)

Just doing this, we get the following error:

F tensorflow/core/framework/tensor.cc:793] Check failed: limit <= dim0_size (1058 vs. 1057)

which I think is because we’re trying to run a computation graph that we have already initialised and run. I’m not really sure what to do with this error.

Failing this, the other way to do it is use python to drive evaluate.py entirely from the command line. This is fine, but requires a lot of file manipulation and a temp directory, so it wasn’t the first thing I considered trying.

Anyway, I hope someone has something useful.


Added context:

In our experiments, manual hyperparameter optimisation of lm_alpha and lm_beta made a difference of about 2-3% to our validation error, so I’m working towards automating the process.

I’ve put together a fairly straightforward strategy following some experiments in which I learned the following:

  • The word error rate appears to be a convex function of lm_alpha, for a fixed lm_beta and beam_width. This suggests that a simple binary search will converge just fine.
    image
  • For sensible values of lm_beta (from say, 0.1 to 4), the word error rate is somewhat linearly increasing in lm_beta, but there is a lot of noise, so finding a good value is not so simple.
    Pasted_image_at_2019-02-04__3_57_PM
  • However, as the names suggest, lm_alpha has a more significant effect on the word error rate than lm_beta does.
  • Unsurprisingly, increasing beam_width also decreases the word error rate but in an exponentially decaying fashion. From what we saw, the default of 1024 is probably a good balance in the tradeoff between performance and inference time. 2048 did do better, but only in the second or lower decimal place (e.g. 11.32% vs 11.36%).
    image%20(1)

So the strategy is to:

  • Fix the beam_width at 1024
  • Fix beta at 1
  • Do some kind of search on lm_alpha, assuming the best value lies somewhere between 0.1 and 4 (this will probably require ~5 evaluations)
  • Fix lm_alpha at the best value, and do a grid search over lm_beta from 0.1 and 3, with a grid width of about 0.1 (about 30 evaluations)

For our (admittedly small) evaluation set evaluate.py, each evaluation takes about 30 seconds.

2 Likes

Actually, now that I think about it, I probably want to drive evaluate.py from the outside in case you guys change evaluate.py again in 0.5.0 or later.

I’m guessing people might still find this discussion (esp. the graphs etc) useful, so I figure I should leave it here. If someone knows what I should have done to solve the error, I still would like to know.

First of all, thanks for reporting your experiments with hyperparam optimization, the behavior of lm_alpha and lm_beta in particular match what I’ve seen when optimizing our release values and it’s nice to see it’s consistent across validation sets.

As for driving hyperparam search in evaluate.py, the best way would be to compute acoustic model predictions once, and then decode several times. The decode step does not depend on TensorFlow so you don’t have to worry at all about sessions/graphs/etc. You get a bit of a speedup since you don’t have to re-run the audio through the acoustic model every time, and you can also close the TF session to release the GPUs for other usage while the (CPU-only) decoding is running.

Currently, it’s not really possible do that externally, calling into evaluate.py from a separate script. The body of the evaluate function needs to be refactored into two smaller functions, one that computes logits from the test/dev data and one that decodes the logits with given parameters. These changes would then allow you to easily decode in a loop do hyperparam search.

I would take a PR that does any part of this. It’s something we need to land anyway (in the past I’ve simply modified the code directly to add a loop around the decode part), so I’ll get to it eventually, but patches are always welcome.

1 Like

Hey Reuben,

I’m glad you were able to corroborate our results, I was hoping someone would.

I think I can put the PR together for this feature. It’ll take a bit of work, but it’ll be worth it. I’ll raise an issue in the github to spark discussion while I work on the feature. There’ll be a few design considerations that it’ll be good to get opinions on.