Thank you for your answer!
Currently I am trying to train a model for German language, spoken over phone.
Currently I am using publicly available datasets:
Altogether it is about 500 hours of the training data, but unfortunately from all here only the very small dataset ‘Forscher’ is spoken in more natural manner, similarly to actual conversations, and only this data correspond closely to the actual use case.
Preprocessing before training:
Because the phone conversation data for which I am going to do the inference have 8kHz sampling rate I first downsample all the training data to 8kHz with a-law encoding (to make it more similar to target data), and than again to 16 kHz for training. Upsampling to 16 kHz is done in the same way as in
sox <input file> --bits 16 --channels 1 --rate 16000 --encoding signed-integer --endian little --compression 0.0 --no-dither <output_file>
The training parameters:
Inference is done with client.py file. I have adjusted the
BEAM_WIDTH = 1024 parameter, I am not sure if I need to change
As mentioned earlier, the data on which I am doing inference are the phone data, so they have 8 kHz sampling rate. I allow the upsampling of this 8 kHz data to 16 kHz to be done by
client.py, because the training data conversion is done in the same way.
The inference results are not that good as the results of testing, and that’s the current problem.