Hi,
I’m training a model for phoneme recognition (so basically the output is phone sequences instead of transcriptions). I trained two models, one for English and one for French (without LM). The Phone Error Rate for English is ~45% and for French is ~75%. I would expect not-good performance without a LM but this is definitely problematic… I checked the posteriors and it seems like the majority of the prediction is on the same phoneme (= always predicting the last symbol in the alphabet, although it does predict something else occasionally). [edit: the probability is always >0.9, so it is not the ambiguity that’s causing the issue.]
Dataset:
-
English: Librispeech-clean (training- 360hrs/ dev-36hrs/test-36hrs)
-
French: Common Voice cleaned (training- 200hrs/ dev-62hrs/ test-50hrs)
“transcriptions” are strings of phones without word boundaries.
Hyperparameters: (same for both models)
–dev_batch_size 64
–train_batch_size 64
–test_batch_size 64
–n_hidden 1024
–epochs 200
–learning_rate 0.0001
–dropout_rate 0.40
–lm_alpha 0.0000000001
–lm_beta 0.0000000001
–plateau_reduction 0.1
–plateau_epochs 8\
I tried smaller learning rate, smaller dropout, larger batch size, but none of them changed the result much. So just wondering if you have any thoughts from your experience what might be the issue.
Clarification: I’m reducing the LM to a very small value and I simply used the one that comes from the repo, as I don’t only care about the phone posteriors from the acoustic model. The lack of LM is indeed one source of bad performance. However, I think the acoustic model itself has issue, as it tends to classify most phones to be the same phone.
Thank you in advance!