We have a use case where people will be asked to say a particular phrase and then we want to measure how well they said it - so it’s a measurement of pronunciation rather than language recognition problem.
We have gotten surprisingly good results looking at letter by letter confidences in terms of measuring the quality of pronunciation (especially when we use a custom alphabet)… but that only works when the sentence it ‘detects’ the speaker saying is the actual sentence they said?
So… the question is, is there a way I can basically ‘force’ the beam search to find a particular path, and measure the letter by letter confidences along the way, instead of looking for the best fit? Or, alternatively, could I somehow ‘swap in’ a language model with a huge weighting on the expected phrase and get the results that way?
This would need to somehow happen at run time - in that the ‘target sentence’ would need to be supplied on each occasion.
Has anyone tried to do something like this?