Low loss but high WER on test set

I trained DeepSpeech v4 on a portion of data from http://openslr.org/53 (Total 33hr data used, and I know it’s not even close for a good model). Upon training, it gives relatively good values for loss on validation and test dataset. But WER on test set is very high.

Parameter for training were :

N_hidden = 2048 (I wanted to overfit to get an overview)
Dropout = 0.2
Learning Rate = 0.0001
Epoch = 50 (Early Stop triggered after 15 epochs)
Beam width = 1024
Alphabet Size = 62
Train/Dev/Test Ratio = 80/10/10
** Other Parameters were their default values

And Model performance:

Train Loss = 13
Dev Loss = 24
Test Loss = 24
Test WER = 0.75
Test Edit Distance = 0.36

The dataset was split based on speaker. Total utterences of each speaker were distributed 80/10/10 among train, dev and test set.

Experiments on another dataset http://openslr.org/37/ gave much lower WER on test set. This dataset contains only 9hrs speech data. Using pitch/tempo/speed augmentation I was able to reach minimum 0.24 WER and 0.08 Edit Distance on test set.

The parameters were all same except N_hidden. N_hidden was 1024 for this experiment.

So, what may be causing the high WER on test set. What things may I investigate other than increasing dataset (Training is prohibitively time consuming on larger dataset). Thanks