Thank you for your answer!
Currently I am trying to train a model for German language, spoken over phone.
Datasets:
Currently I am using publicly available datasets:
Altogether it is about 500 hours of the training data, but unfortunately from all here only the very small dataset ‘Forscher’ is spoken in more natural manner, similarly to actual conversations, and only this data correspond closely to the actual use case.
Preprocessing before training:
Because the phone conversation data for which I am going to do the inference have 8kHz sampling rate I first downsample all the training data to 8kHz with a-law encoding (to make it more similar to target data), and than again to 16 kHz for training. Upsampling to 16 kHz is done in the same way as in client.py
:
sox <input file> --bits 16 --channels 1 --rate 16000 --encoding signed-integer --endian little --compression 0.0 --no-dither <output_file>
The training parameters:
–lm_trie_path <trie_path>
–lm_binary_path <lm.binary_path>
–checkpoint_dir <checkpoints_path>
–export_dir <model_export_path>
–alphabet_config_path <alphabet.txt_path>
–train_files <train.csv_path>
–dev_files <dev.csv_path>
–test_files <test.csv_path>
–es_steps 5
–train_batch_size 24
–dev_batch_size 48
–test_batch_size 48
–n_hidden 2048
–learning_rate 0.0001
–dropout_rate 0.18
–epochs 50
–decoder_library_path native_client/libctc_decoder_with_kenlm.so
–summary_secs 600
Inference:
Inference is done with client.py file. I have adjusted the BEAM_WIDTH = 1024
parameter, I am not sure if I need to change N_FEATURES
and N_CONTEXT
though.
As mentioned earlier, the data on which I am doing inference are the phone data, so they have 8 kHz sampling rate. I allow the upsampling of this 8 kHz data to 16 kHz to be done by client.py
, because the training data conversion is done in the same way.
The inference results are not that good as the results of testing, and that’s the current problem.