Trained model is bad in new sentences but very good in trained sentences

Hi all,

I have trained a Chinese model with version v0.7.4, 33Hrs mp3 from Common Voice. The result of the trained sentences is very accurate. Even the sentence is in different sources. For example,
source: “對外界提出的問題第一時間交代” (a.mp3)
result : “對外界提出的問題第一時間交代” (recognized in the microphone by web socket)

However, it is bad in new sentences and the combination of the old sentences. For example:
source: “對外界提出的問題第一時間交代” (a.mp3)
source: “尊敬的陳冬副主任” (c.mp3)
And i try to combine some part of above two sentences:

The sentence i want: “陳冬副主任對外界提出的問題”
The result : “陳姑吉列任對外界提出答左我” (recognized in the microphone by web socket)

I am very sorry that I am unavailable to provide the training command and code to you. I can provide the info below only:
Language: Chinese
accent : Hong Kong (Cantonese)
lm: generated by myself with 2904 characters, 5ngrams
MP3 source: Common Voice 33Hrs
dropout rate:0.22
checkpoint size:222195
best-dev size: 222195
I expect my model can do the combination(mix 2 sentence words) with trained sentences.
Q1.What is the problem of my model? Is the data set is not enought?Any solution?

Q2.Besides, I will train the model with new data. If the new data makes the loss functions larger, will the model ignore the new training and use the old checkpoint with lower loss function? The question why i doubt is the testing step always said “I Loading best validating checkpoint from ./path/best_dev-222195”. And i know, the large loss training never save in best_dev.

Thank you

You are most likely overfitting the model. Search for overfitting. Usually you need a couple hundred hours of input to get a somewhat general language model.

If you start the training with a checkpoint-dir argument it will search in that dir for an old checkpoint. If you leave that out, you start a new training from scratch. use a different checkpoint-dir to seperate runs.

In addition make sure the problem is not on the language model side.

I believe your language model is causing biases. Since the “new sentence” you tested with doesn’t sound like the results you got, not at all. But the results are grammatically meaningful and correct, it’s caused by the misuse of the LM

How I misuse the LM?
Can train the model without LM?
But how can I run the client without LM?

Hi @alansiu,
As far as I know, the language model, or to be more precise, the scorer, is only applied during the inference time. So yes, you can train the model without LM, and you can use the model without it as well.

I’m not sure what client that you meant, but basically just don’t specify a LM if you don’t want to use one.

And on a side note, you said you trained the model with only 33 hours of audio, I believe it wasn’t adequate. Overfitting of the model could also give you the results like this.

Regards

1 Like