Forcing language model to detect specific phrases?

We have a use case where people will be asked to say a particular phrase and then we want to measure how well they said it - so it’s a measurement of pronunciation rather than language recognition problem.

We have gotten surprisingly good results looking at letter by letter confidences in terms of measuring the quality of pronunciation (especially when we use a custom alphabet)… but that only works when the sentence it ‘detects’ the speaker saying is the actual sentence they said?

So… the question is, is there a way I can basically ‘force’ the beam search to find a particular path, and measure the letter by letter confidences along the way, instead of looking for the best fit? Or, alternatively, could I somehow ‘swap in’ a language model with a huge weighting on the expected phrase and get the results that way?

This would need to somehow happen at run time - in that the ‘target sentence’ would need to be supplied on each occasion.

Has anyone tried to do something like this?

Not to our knowledge

You could live-build the language model on each sentence. Since your dataset will be very reduced, it should not be long and not require huge amount of memory. However, it’s going to be quite invasive and it is going to add latency. So I guess it depends on how much this impacts your workflow.

I don’t know if just building a normal language model for a sentence would work, as it can still generate outputs with unknown probability in the LM, and there’s also the possibility of repeated n-grams within a single sentence.

Perhaps the easiest way would be to build a custom trie structure that only has the exact sequence of characters in your target sentence, and then use just the trie, without a language model. It would require a few changes to the decoder code where it assumes that LM and trie always are present together (or not present), but probably nothing major.

Thanks heaps @reuben and @lissyx for this. I will see how far I get with trying to build a simple/ish language model at ‘run time’. In our use case that will work because if we cache the language model by hash of the target sentence most of them will get used multiple times. Also, it’s probably actually a good thing that if the user says something that is really, wildly, different from the target sentence that it won’t atempt to score it as if it was the target sentence pronunced badly. What we do want to do is make it likely to squeeze it into the target sentence isf the ‘student’ says something vaguely resembling the target sentence.

I notice that all the language model/trie building stuff has changed in 0.7… Haven’t looked into it yet but is there much chance I can grab an existing trie model in python aka hacking - and mess with the probabilities programmatically to sort of ‘graft’ on the probabilities target sentence as ‘by far and away the most likely’ candidate text?

I’ll take the time to look into it next, but if you think its the right or wrong direction it would be good to know…

Again, I’m not sure what the best way would be to handle the KenLM part, but for the trie, a good start would be to simply change the scorer.fill_dictionary(list(words)) in data/lm/ to scorer.fill_dictionary(["target sentence goes here"]).

Basically, instead of building a trie which is a vocabulary of words, you’re building a vocabulary of sentences, so to speak. In your case, it only has a single sentence.

@utunga Did you have much luck with this? I am interested in achieving a similar effect… I have installed KenLM and I’m trying to build my own language model, but haven’t succeeded yet… Could you let me know how you are getting on?

Hey @reuben thanks for the suggestion, thought I’d report back on progress… (took a while to upgrade all our stuff to 0.7, sorry).

So, yeah, just to report back I created a custom model, per suggestion by making some small edits to, overriding this to force a particular dictionary, this way:

    words = set()
    for word in ["ka","arohia","katoatia","te","tāhi","mea","ƒakapono","te","hapū","o","ōtākou"]:

So here’s what happened. The actual target sentence (per standard langauge model) was:

ka arohia katoatia te hāhi me ōna ƒakapono e te hapū o ōtākou

I was trying to force it to produce this (nonsense) sentence

ka arohia katoatia te tāhi mea ƒakapono e te hapū o ōtākou

But what I actually got was:

ka arohia katoatia te ka mea ƒakapono o te hapū o ōtākou

As you can see, the attempt to force hāhi me ōna to be tāhi mea didn’t really work. I mean it makes sense, ka was in the dictionary already and maybe its acoustically closer to hāhi than tāhi anyway, but yes, for whatever reason this approach isn’t gonna do what we need it to do I guess…

So… hmm I guess maybe the next thing I could look at is maybe provide a custom C++ implementation of Scorer ? Something where you give it the full target sentence at run time ?

@Paul_Raine very curious to find out more about what approach you used.

PS Just to clarify I hope I understood the suggestion correctly. Doing it exactly per your comment…

scorer.fill_dictionary(["ka arohia katoatia te tāhi mea ƒakapono te hapū o ōtākou"])

…results in…
TypeError: in method 'Scorer_fill_dictionary', argument 2 of type 'std::vector< std::string,std::allocator< std::string > > const &'

Isn’t that just because of str vs. bytes? Have you tried doing

scorer.fill_dictionary(["ka arohia katoatia te tāhi mea ƒakapono te hapū o ōtākou".encode("utf-8")])

Oh right of course it’s just one big word I didn’t understand at all what you meant… I get it now !!

I tried it… it works! Do’h of course.

Made a custom.scorer via:

scorer.fill_dictionary(["ka arohia katoatia te tāhi mea ƒakapono te hapū o ōtākou".encode("utf-8")]) 

And then, I guess it only has one word in the dictionary so that worked.

Well actually, to be 100% accurate, when I first tried it, it gave me this

ka arohia katoatia te tā

guess it really doesn’t wanna make that sentence (which is why its quite a good test sentence)! But when I turned alpha all the way to 0:

python3 -u DeepSpeech/ \
          --scorer_path 'custom.scorer' \
          --alphabet_config_path  '../data/lm/base_encoder/alphabet.txt' \
          --checkpoint_dir ../models/20200605_ds.0.7.1_thm/checkpoints \
          --summary_dir ../models/20200605_ds.0.7.1_thm/summaries  \
          --lm_alpha  0 \
          --lm_beta  1.85 \
          --one_shot_infer 'test.wav'

it gave me the target sentence back :

ka arohia katoatia te tāhi mea ƒakapono te hapū o ōtākou

Yay. So yup this totally works. Sorry I didn’t quite get it at first. :wink:



Hm, that makes sense, it’s something I didn’t think about. With a single phrase vocabulary like that, it’ll restrict the decoder to only output a prefix of the phrase, instead of forcing it to be output exactly, because a partial beam can have a higher score than the complete one.

I’m not sure if lm_alpha=0 is guaranteed to make this work every time, though because of the word addition bonus in lm_beta it should tend towards the full sentence. Maybe setting both lm_alpha=0 and also lm_beta sufficiently high is enough to cover all cases.

Anyway, at least you’ve made progress. Keep us updated on how the pronunciation help project is going, this is super interesting stuff!

Great that it works @utunga , but how do you measure how well the pronunciation is if you provide just the target sentence? Confidence value or with/without scorer?

Good question @othiele … to give a bit of needed context we had made some other changes to expose the values of the logits in the output layer (ie the ‘confidences’ of the NN in that prediction).

By forcing the language model to pick specific letters in each position we are hoping to (more easily) get at the logit values for each letter in the target sentence. I would think of it as the confidence of the acoustic model in that prediction.

In the case of te reo Māori the pronunciation can be predicted quite accurately from the letters (and, presumably, vice versa) so perfect pronunciation should result in the acoustic model accurately getting things right on a letter by letter basis.

That said we are also experimenting with training our NN using different ‘alphabets’ that map even more directly to the exact phonemes we are trying to model… and will fine tune the training of our acoustic model with audio only from people with particularly good pronunciation.

1 Like