I am training the model on “read aloud” data and the test results for similar data are quite good (around 0.2 WER). The problem is, that this model should be used for spoken, more carelessly spoken language and then the quality of transcription is significantly worse.
Would you have any ideas on how to modify the “read aloud” dataset to train the model for spoken language? Would something like making the train audio files faster or adding some noise to them make any sense?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
What is this dataset ? In general, you should share much more informations, we have no idea how much data you have, how your training and your inference are performed, there are multiple factors that could explain poor results.
Altogether it is about 500 hours of the training data, but unfortunately from all here only the very small dataset ‘Forscher’ is spoken in more natural manner, similarly to actual conversations, and only this data correspond closely to the actual use case.
Preprocessing before training:
Because the phone conversation data for which I am going to do the inference have 8kHz sampling rate I first downsample all the training data to 8kHz with a-law encoding (to make it more similar to target data), and than again to 16 kHz for training. Upsampling to 16 kHz is done in the same way as in client.py:
Inference:
Inference is done with client.py file. I have adjusted the BEAM_WIDTH = 1024 parameter, I am not sure if I need to change N_FEATURES and N_CONTEXT though.
As mentioned earlier, the data on which I am doing inference are the phone data, so they have 8 kHz sampling rate. I allow the upsampling of this 8 kHz data to 16 kHz to be done by client.py, because the training data conversion is done in the same way.
The inference results are not that good as the results of testing, and that’s the current problem.
Hello @Jendker,
500h is not enough to start the training from scratch, maybe try transfer learning? I’m getting the best result with transfer learning using 500h of Spanish.