Questions about the lm_optimizer.py process?

Hello again,

I am working on creating a functional scorer for my data. I wanted to get some clarification on a few things.

  1. I create the scorer before I train my model, correct? The scorer will aid in the training process as a part of the loss function (ctc_decoder)? I am fine with using the generate_lm.py script to do this. I noticed that in DeepSpeech 0.7.4 the generate_package.py has moved to a .cpp file but the documentation doesn’t reflect this. To get around this I have just cloned the repo with the 0.7.3 branch to get back the old generate_package.py functionality. I am able to execute this without problem.

It would be nice to see some documentation on using the cpp version if this is the route that DeepSpeech is going for the scorer.

  1. The documentation on the lm_optimizer.py is a bit sparse. It seems to share the same flags as with DeepSpeech.py . From what I understand, I generate the .scorer file above. then use something like:
    python lm_optimizer.py --test_files /path/to/dev.csv --scorer_path /scorer-generated above.scorer

Just to clarify I should do this all before I start training my model, correct? The scorer optimization should also be done on my dev.csv files, not my test files as to not unbias the results, right?

After completing this, I should re-make my .scorer file with the suggested alpha and beta provided.

  1. I was performing the lm_optimizer as above and it seemed to be working but I am getting the good ol swig memory leak error:

I’m not sure if this is a big deal as it doesn’t halt execution, I just want to clarify no issues are present. The scorer ran for a single epoch, gave some output, then continued running on another “test epoch”. I guess I am a bit confused on what this scorer is doing since it isn’t a network, how is it “training” on the data?

  1. Ideally after optimizing the .scorer, I should be good to go with training and point to this new scorer when I run DeepSpeech.py

Thank you,

No, scorer has nothing to do with training, please read some of the posts for custom language models to understand that better, building the right one can be tricky.

No, also for future reference: Scorer, lm_alpha and lm_beta have nothing to do with training (train/dev), only with testing. The scorer is like a big dictionary and lm_alpha and lm_beta are parameters on how to search better for n-grams in the dictionary with what you get as an output from the neural net. lm_optimizer will let you find the best parameters for a given dev set and scorer.

You’ll need a good test setup for the end of the training. This includes the scorer and test set. But you can run that independently from training. This is usually all in one call for simplicity.

No lm_alpha and lm_beta values involved in creating a scorer, please read posts on that.

Haven’t had that before, @lissyx any ideas? Or search the forum.

I see. so --scorer_path is purely for lumping together the scorer decoding part at the end of training. The Deepspeech.py script will test on the model as well as the scorer at the end of training?

How many “test” epochs will the scorer go through? Mine seemed to keep on going after it reported some information including the lm_alpha and lm_beta. Why would it keep going if it finished on all of the test data? I did it separately from training though so not sure if it is going through the test files from the model checkpoints or something first.

Also, in regards to the Swig memory leak error. This only started occurring after I cloned the 0.7.3 branch. It was fine in the 0.7.4 branch (but 0.7.4 lacks the generate_package.py script now)

Yes, after the neural net is trained, testing uses both the output of the net and a beam search in the scorer to find the best result.

Just one, you should see the progress in the logs.

It doesn’t if you just give the test set, not train nor dev.

1 Like

Wanted to follow up on this topic. Since the documentation for lm_optimizer.py is sparse, why do I have to provide a model checkpoint? I want to make sure I understand it. Examples only mention providing a scorer and --test_files (which should actually be dev files to avoid biasing results) as well as --n_trials.

Now this is fine and all, but it also requires a model checkpoint file to run the script, that is, we are optimizing the scorer based on a specific model + scorer? I thought it was solely for the optimization of the scorer? Why is a model checkpoint needed as well?

It is optimizing the parameters of how to do beam search in a scorer from a given model. Give it all the needed input and you’ll get the best parameters to query your model with a given scorer. I haven’t seen the code, but as the checkpoints contain a frozen net, you can query more information :slight_smile:

If you don’t have a checkpoint try different values like 1/1, 1.5/0.5, 0.5/1.5, … with a test set. Depending on the use case you don’t need too detailed values.

Excellent. I think this is more clear. If I am making many models for comparison, it would be appropriate to optimize the LM for each model separately, correct?

Yesterday I noticed changing my lm_alpha and lm_beta had substantial improvements in my WER/CER results, so I want to make sure I am optimizing correctly.

Yes, you need different lm_alpha/beta for each model as they significantly change the way to select a path in the beam search.

Search the web for CTC beam search to understand more about the process.

1 Like

it’s just because it is re-using the training’s test code, which depends on a checkpoint.

2 Likes

Perfect. Now I understand.