Assigning weights to certain words while training DeepSpeech Model


I’ve been training my DeepSpeech model recently and I was just wondering if there was any way to improve Named Entity Recognition from the audio files, and having it correctly decode the named entities in the transcript.

One method that I have in mind is to give higher weights to the Named Entities in the transcript when decoding. Is it possible to do increase the weights of certain words when decoding using DeepSpeech? If so, would anyone be able to point me in the right direction of how I should go about editing the source code.

If not, would there be a better method to improve the named entity recognition task using DeepSpeech by editing the source code in any other way?

Thank you.

I guess it would be easier to take a custom language model as testing would be a lot faster and cheaper. Results might not be that different, depending on the use case.

@reuben should know how to change the weights, but that is definitely more complex.

I see, would customizing the language model with kenlm library work?

Yes, search for custom language model in the forum. You can build a new language model without retraining the net.

You probably want many sentences with your Entities in many possible word combinations.

Thank you, will try that out.

Hi @othiele, I’ve noticed that during training of DeepSpeech there is this --scorer flag with default path /data/lm/kenlm.scorer. Does this mean that I have to retrain the model with the new scorer file that I’ve generated?

Or can I simply use the model that I’ve trained previously without a scorer, and when transcribing new audio files simply change the scorer line to the path of my new scorer?

deepspeech --model deepspeech-0.7.3-models.pbmm --scorer deepspeech-0.7.3-models.scorer --audio my_audio_file.wav

Yes. The scorer does not affect the acoustic model training.

As Reuben said, leave the pbmm file alone and simply experiment with scorer files. The format hast changed in the 0.7 branch from lm.binary and trie to a combined scorer, but most of the stuff with KenLM stayed the same. Read the docs of it - not easy - and follow this

Yep got it thanks, will try it out!! @reuben @othiele

And you can use the docs for further reference

@reuben, would it be possible to point me in the right direction as to how I should go about changing the weights assigned to particular characters/words?

That is what we’ve been doing. That is what a language model does. Assign weights to words.

1 Like

@reuben Hmm ok but is it possible for us to modify the DeepSpeech Model such that it places more emphasis on the named entities?

i.e. to say if I include named entities in tags <>, is it possible for DeepSpeech to be able to recognise the words in <>, and perhaps place more emphasis (by giving these words higher weights) on being able to accurately transcribe these words and less emphasis on words outside <>?

I would be measuring how accurately the model is able to decode these named entities in <> correctly, rather than the WER of the model.

It is possible, but I wouldn’t be able to point you in any direction more specific than just general literature on named entity recognition. Likely it would involve adapting the decoder, or writing a new one. You can take a look at ctc_beam_search_decoder.cpp to see how the current decoder is implemented.

Thank you for the advice, meaning to say I have to edit the ctc_beam_search_decoder.cpp file for my named entity recognition task?

@othiele I’ve heeded your advice to increase the size of my custom language model (scorer file) by adding more sentences into it. These sentences were taken off corpus from the web, and they contained the named entities that were present in my training/test data.

After running inferences on my test files, I realised that it fared even more poorly when it comes to decoding the named entities correctly as compared to if I were simply to use the original scorer file which was built using sentences from the training and dev set of my dataset.

Possible reason for this I think would be that:

  1. The distribution of the additional sentences added to the language model is different from that of those in the training/test data.
    If so, is there any way for me to generate more sentences from my current training/dev dataset to add to the language model to ensure the distribution of the dataset is constant both in the language model and the training data?
    (Prefably one that is does not require me to manually generate sentences myself to be added to the model)

Thank you in advance!!

Please give us more info about your project: English, with accent, what training data, how many entities, examples and some numbers. “Poorly” doesn’t give us much to go on.

Ok so I’m training on a Malay dataset that I had from my company, not sure of the particular source but it can’t be found on the web. It consists of 15 min audio clips of people talking over the phone, with transcripts of named entities in capital letters.

I’ve split the audio clips and transcripts into short clips of around 1 sentence each, and trained it using DeepSpeech with 25 epochs.

After hitting a WER of about 0.59, I used this model to run inference on the test file. I then used the following metric to calculate the recall score for the NER task.

Recall = No. of NEs correctly decoded/Total no. of NEs in test transcripts

The recall score I got for the scorer with only sentences from the training set was around 0.18, which is pretty low. After adding in an additional 1.5 million sentences to the scorer file from the malay dataset that I got from the following sources:

The recall score fell to 0.15.

This is what I did, the results look pretty strange to me.

Thanks, how much training material do you have in total in hours and are the chunks you created from the longer inputs really exact?

Here’s a breakdown of the dataset I used to train my model.

TRAIN: 156H 59 MIN,59940 files
DEV: 20H 14MIN, 7814 audio files
TEST: 19H 02 MIN, 7293 audio files
–train_files /home/stieon/new_malay_dataset/converted_train_audio/train.csv
–dev_files /home/stieon/new_malay_dataset/converted_dev_audio/dev.csv
–test_files /home/stieon/new_malay_dataset/converted_test_audio/test.csv
–epochs 50
–checkpoint_dir /home/stieon/checkpoint2/
–export_dir /home/stieon/model_dir_v2_with_scorer/
–train_batch_size 1
–dev_batch_size 1
–test_batch_size 1
–learning_rate 0.00005
–train_cudnn true
–log_dir /home/stieon/log
–dropout_rate 0.1
–early_stop true
–es_epochs 10
–report_count 7293
–scorer /home/stieon/new_malay_dataset/expanded_malay.scorer &

Early stopping got triggered at around epoch 25.

The chunks I created from the longer inputs are exact as I have manually checked through them and they correspond to the transcript.

Could it be that I have hit the maximum capabilities of my dataset?