English Subset Custom Scorer Optimization

Hi all,

I’m attempting to make a custom scorer for a specific domain. As I understand it there’s no need to retrain an English model for my domain, ect, but can just use a custom scorer with the vocabulary I expect.

From the documentation this essentially has 4 steps.

  1. Run generate_lm in order to create the KenLM Scorer
  2. Use generate_scorer_package to package into a DS scorer package (with any alpha and beta)
  3. Run lm_optimizer to find optimal values for alpha and beta.
  4. Run generate_scorer_package a second time with new alpha and beta for an optimized package.

I’ve done steps 1 and 2 without issue to the best of my knowledge, but how am I supposed to take care of #3 without a test_data.csv file? My understanding is this would come from the process of creating the English model. Is there a standard sample csv I can use to optimize my scorer?

EDIT:
I’m guessing the best way to approach this initially is to use the CV Corpus. Is there any guidance on what is best to use here? Whether that’s my own dataset or which particular test set from the CV Corpus to use?