Training model for fluently spoken language

Jendker · August 9, 2019, 3:30pm

Hello everyone,

I am training the model on “read aloud” data and the test results for similar data are quite good (around 0.2 WER). The problem is, that this model should be used for spoken, more carelessly spoken language and then the quality of transcription is significantly worse.

Would you have any ideas on how to modify the “read aloud” dataset to train the model for spoken language? Would something like making the train audio files faster or adding some noise to them make any sense?

lissyx · August 9, 2019, 10:02am

What is this dataset ? In general, you should share much more informations, we have no idea how much data you have, how your training and your inference are performed, there are multiple factors that could explain poor results.

Jendker · August 9, 2019, 12:29pm

Thank you for your answer!

Currently I am trying to train a model for German language, spoken over phone.

Datasets:
Currently I am using publicly available datasets:

Common Voice
Voxforge (http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/)
Zamia (http://goofy.zamia.org/zamia-speech/corpora/zamia_de/)
TUDA (http://ltdata1.informatik.uni-hamburg.de/kaldi_tuda_de/german-speechdata-package-v3.tar.gz)
SWC (https://www2.informatik.uni-hamburg.de/nats/pub/SWC/SWC_German.tar)
Forscher (http://goofy.zamia.org/zamia-speech/corpora/forschergeist/)
Librivox (http://www.caito.de/data/Training/stt_tts/de_DE.tgz)

Altogether it is about 500 hours of the training data, but unfortunately from all here only the very small dataset ‘Forscher’ is spoken in more natural manner, similarly to actual conversations, and only this data correspond closely to the actual use case.

Preprocessing before training:
Because the phone conversation data for which I am going to do the inference have 8kHz sampling rate I first downsample all the training data to 8kHz with a-law encoding (to make it more similar to target data), and than again to 16 kHz for training. Upsampling to 16 kHz is done in the same way as in client.py:

sox <input file> --bits 16 --channels 1 --rate 16000 --encoding signed-integer --endian little --compression 0.0 --no-dither <output_file>

The training parameters:
–lm_trie_path <trie_path>
–lm_binary_path <lm.binary_path>
–checkpoint_dir <checkpoints_path>
–export_dir <model_export_path>
–alphabet_config_path <alphabet.txt_path>
–train_files <train.csv_path>
–dev_files <dev.csv_path>
–test_files <test.csv_path>
–es_steps 5
–train_batch_size 24
–dev_batch_size 48
–test_batch_size 48
–n_hidden 2048
–learning_rate 0.0001
–dropout_rate 0.18
–epochs 50
–decoder_library_path native_client/libctc_decoder_with_kenlm.so
–summary_secs 600

Inference:
Inference is done with client.py file. I have adjusted the BEAM_WIDTH = 1024 parameter, I am not sure if I need to change N_FEATURES and N_CONTEXT though.
As mentioned earlier, the data on which I am doing inference are the phone data, so they have 8 kHz sampling rate. I allow the upsampling of this 8 kHz data to 16 kHz to be done by client.py, because the training data conversion is done in the same way.

The inference results are not that good as the results of testing, and that’s the current problem.

carlfm01 · August 10, 2019, 2:29am

Hello @Jendker,
500h is not enough to start the training from scratch, maybe try transfer learning? I’m getting the best result with transfer learning using 500h of Spanish.

You can start reading here:

Jendker · August 10, 2019, 10:12pm

Thanks for the link, it looks very promising! I’ll try that out soon.