I’m attempting to make a custom scorer for a specific domain. As I understand it there’s no need to retrain an English model for my domain, ect, but can just use a custom scorer with the vocabulary I expect.
From the documentation this essentially has 4 steps.
generate_lmin order to create the KenLM Scorer
generate_scorer_packageto package into a DS scorer package (with any alpha and beta)
lm_optimizerto find optimal values for alpha and beta.
generate_scorer_packagea second time with new alpha and beta for an optimized package.
I’ve done steps 1 and 2 without issue to the best of my knowledge, but how am I supposed to take care of #3 without a
test_data.csv file? My understanding is this would come from the process of creating the English model. Is there a standard sample csv I can use to optimize my scorer?
I’m guessing the best way to approach this initially is to use the CV Corpus. Is there any guidance on what is best to use here? Whether that’s my own dataset or which particular test set from the CV Corpus to use?