Hi all,
I’m attempting to make a custom scorer for a specific domain. As I understand it there’s no need to retrain an English model for my domain, ect, but can just use a custom scorer with the vocabulary I expect.
From the documentation this essentially has 4 steps.
- Run
generate_lm
in order to create the KenLM Scorer - Use
generate_scorer_package
to package into a DS scorer package (with any alpha and beta) - Run
lm_optimizer
to find optimal values for alpha and beta. - Run
generate_scorer_package
a second time with new alpha and beta for an optimized package.
I’ve done steps 1 and 2 without issue to the best of my knowledge, but how am I supposed to take care of #3 without a test_data.csv
file? My understanding is this would come from the process of creating the English model. Is there a standard sample csv I can use to optimize my scorer?
EDIT:
I’m guessing the best way to approach this initially is to use the CV Corpus. Is there any guidance on what is best to use here? Whether that’s my own dataset or which particular test set from the CV Corpus to use?