WER and Loss increase when adding more data

I trained a model with about 140 hours (70999 files) with the following configuration:

–train_batch_size 80
–dev_batch_size 80
–test_batch_size 40
–n_hidden 375
–epoch 1000
–validation_step 1
–early_stop True
–earlystop_nsteps 6
–estop_mean_thresh 0.1
–estop_std_thresh 0.1
–dropout_rate 0.22
–learning_rate 0.00095
–report_count 100
–use_seq_length False \

i trained multiple models adding the data sequentially and WER and loss kept dropping, but when added more data and trained the model with 230 hours (162638 files) and the same configuration the loss and WER started increasing. is that data related ? or should i change the configuration? or what type of tests should i do?

Unfortunately the answer is “It depends”.

However, my guess is that you may be overfitting with 140 hours and are not overfitting with 230 hours.

A follow on questions

  • Are the extra 90 hours drawn from the same distribution as the initial 140 hours?
  • How many epochs do you train on the extra 90 hours before “giving up”?

not same distribution
if i trained new data from scratch it takes like 34 epochs
and if i trained starting from last best model with less data (starting from frozen model) about 15 epochs. in both cases early stop is triggered

When the new data is not from the same distribution, then “all bets are off”. The WER can go down, up, or stay the same.

For example say the first 140 hours were recording in a recording studio with basically no noise, but the additional 90 hours were recorded on a cheap microphone in a train station. One would then expect the WER to increase when adding the new data.

2 Likes

yea, i doubted that too, thanks for the help. i’m trying to check each dataset on it’s own before adding them to the full data