Assigning weights to certain words while training DeepSpeech Model

othiele · June 16, 2020, 11:12am

I guess it would be easier to take a custom language model as testing would be a lot faster and cheaper. Results might not be that different, depending on the use case.

@reuben should know how to change the weights, but that is definitely more complex.

tieonster · June 17, 2020, 2:23am

I see, would customizing the language model with kenlm library work?

othiele · June 17, 2020, 7:18am

Yes, search for custom language model in the forum. You can build a new language model without retraining the net.

You probably want many sentences with your Entities in many possible word combinations.

tieonster · June 18, 2020, 10:50am

Thank you, will try that out.

tieonster · June 18, 2020, 5:58am

Hi @othiele, I’ve noticed that during training of DeepSpeech there is this --scorer flag with default path /data/lm/kenlm.scorer. Does this mean that I have to retrain the model with the new scorer file that I’ve generated?

Or can I simply use the model that I’ve trained previously without a scorer, and when transcribing new audio files simply change the scorer line to the path of my new scorer?

deepspeech --model deepspeech-0.7.3-models.pbmm --scorer deepspeech-0.7.3-models.scorer --audio my_audio_file.wav

reuben · June 18, 2020, 8:01am

Yes. The scorer does not affect the acoustic model training.

othiele · June 18, 2020, 8:20am

As Reuben said, leave the pbmm file alone and simply experiment with scorer files. The format hast changed in the 0.7 branch from lm.binary and trie to a combined scorer, but most of the stuff with KenLM stayed the same. Read the docs of it - not easy - and follow this

tieonster · June 18, 2020, 8:26am

Yep got it thanks, will try it out!! @reuben @othiele

othiele · June 18, 2020, 8:36am

And you can use the docs for further reference

https://deepspeech.readthedocs.io/en/v0.7.3/Scorer.html#external-scorer-scripts

tieonster · June 18, 2020, 10:50am

@reuben, would it be possible to point me in the right direction as to how I should go about changing the weights assigned to particular characters/words?

reuben · June 18, 2020, 11:44am

That is what we’ve been doing. That is what a language model does. Assign weights to words.

tieonster · June 18, 2020, 12:41pm

@reuben Hmm ok but is it possible for us to modify the DeepSpeech Model such that it places more emphasis on the named entities?

i.e. to say if I include named entities in tags <>, is it possible for DeepSpeech to be able to recognise the words in <>, and perhaps place more emphasis (by giving these words higher weights) on being able to accurately transcribe these words and less emphasis on words outside <>?

I would be measuring how accurately the model is able to decode these named entities in <> correctly, rather than the WER of the model.

reuben · June 18, 2020, 1:11pm

It is possible, but I wouldn’t be able to point you in any direction more specific than just general literature on named entity recognition. Likely it would involve adapting the decoder, or writing a new one. You can take a look at ctc_beam_search_decoder.cpp to see how the current decoder is implemented.

tieonster · June 18, 2020, 1:21pm

Thank you for the advice, meaning to say I have to edit the ctc_beam_search_decoder.cpp file for my named entity recognition task?

tieonster · June 24, 2020, 4:03am

@othiele I’ve heeded your advice to increase the size of my custom language model (scorer file) by adding more sentences into it. These sentences were taken off corpus from the web, and they contained the named entities that were present in my training/test data.

After running inferences on my test files, I realised that it fared even more poorly when it comes to decoding the named entities correctly as compared to if I were simply to use the original scorer file which was built using sentences from the training and dev set of my dataset.

Possible reason for this I think would be that:

The distribution of the additional sentences added to the language model is different from that of those in the training/test data.
If so, is there any way for me to generate more sentences from my current training/dev dataset to add to the language model to ensure the distribution of the dataset is constant both in the language model and the training data?
(Prefably one that is does not require me to manually generate sentences myself to be added to the model)

Thank you in advance!!

othiele · June 24, 2020, 7:19am

Please give us more info about your project: English, with accent, what training data, how many entities, examples and some numbers. “Poorly” doesn’t give us much to go on.

tieonster · June 24, 2020, 9:09am

Ok so I’m training on a Malay dataset that I had from my company, not sure of the particular source but it can’t be found on the web. It consists of 15 min audio clips of people talking over the phone, with transcripts of named entities in capital letters.

I’ve split the audio clips and transcripts into short clips of around 1 sentence each, and trained it using DeepSpeech with 25 epochs.

After hitting a WER of about 0.59, I used this model to run inference on the test file. I then used the following metric to calculate the recall score for the NER task.

Recall = No. of NEs correctly decoded/Total no. of NEs in test transcripts

The recall score I got for the scorer with only sentences from the training set was around 0.18, which is pretty low. After adding in an additional 1.5 million sentences to the scorer file from the malay dataset that I got from the following sources:

The recall score fell to 0.15.

This is what I did, the results look pretty strange to me.

othiele · June 24, 2020, 9:43am

Thanks, how much training material do you have in total in hours and are the chunks you created from the longer inputs really exact?

tieonster · June 24, 2020, 9:49am

Here’s a breakdown of the dataset I used to train my model.

TRAIN: 156H 59 MIN,59940 files
DEV: 20H 14MIN, 7814 audio files
TEST: 19H 02 MIN, 7293 audio files

DeepSpeech.py
–train_files /home/stieon/new_malay_dataset/converted_train_audio/train.csv
–dev_files /home/stieon/new_malay_dataset/converted_dev_audio/dev.csv
–test_files /home/stieon/new_malay_dataset/converted_test_audio/test.csv
–epochs 50
–checkpoint_dir /home/stieon/checkpoint2/
–export_dir /home/stieon/model_dir_v2_with_scorer/
–train_batch_size 1
–dev_batch_size 1
–test_batch_size 1
–automatic_mixed_precision
–learning_rate 0.00005
–train_cudnn true
–log_dir /home/stieon/log
–dropout_rate 0.1
–early_stop true
–es_epochs 10
–report_count 7293
–scorer /home/stieon/new_malay_dataset/expanded_malay.scorer &

Early stopping got triggered at around epoch 25.

The chunks I created from the longer inputs are exact as I have manually checked through them and they correspond to the transcript.

Could it be that I have hit the maximum capabilities of my dataset?

othiele · June 24, 2020, 10:17am

Hm, you should be able to get a bit more out of it, especially if the entities occur more often.

Try a higher batch size, if your GPU can make it: 4, 8 or higher

Use a learning rate of 0.0001

Definitely use a higher dropout of 0.25-0.4