Transfer learning with new Alphabets(new vocabulary) with Common Speech data | Training loss is increasing as epoch is under progress

If you’ve found a bug, or have a feature request, then please create an issue with the following information:

  • Have I written custom code (as opposed to running examples on an unmodified clone of the repository) : NO
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04) : Ubuntu 18.04
  • TensorFlow installed from (our builds, or upstream TensorFlow) : Upstream tensorflow r1.15.3 (with GPU)
  • TensorFlow version (use command below) : tensorflow r1.15.3 (with GPU)
  • Python version : Python3
  • Bazel version (if compiling from source) : tensorflow compiled from sources bazel version 0.26.1
  • GCC/Compiler version (if compiling from source) : gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
  • CUDA/cuDNN version : CUDA 10.2 / CUDNN v7.6.5
  • GPU model and memory : NVIDIA 1080 Ti (11 GB)
  • Exact command to reproduce :
python3 DeepSpeech.py \
    --n_hidden 2048 \
    --drop_source_layers 1 \
    --alphabet_config_path data/new_alphabet.txt \
    --save_checkpoint_dir /data/Self/test/DeepSpeech/train_3/ \
    --load_checkpoint_dir /data/Self/test/DeepSpeech/checkpoint/ \
    --train_files   data/clips/train.csv \
    --dev_files   data/clips/dev.csv \
    --test_files  data/clips/test.csv \
    --learning_rate 0.000005 \
    --use_allow_growth true \
    --train_cudnn \
    --epochs 20 \
    --export_dir /data/Self/test/DeepSpeech/train_3/ \
    --summary_dir /data/Self/test/DeepSpeech/train_3/summary \
    --train_batch_size 32 \
    --dev_batch_size 32 \
    --test_batch_size 32 \
    --export_batch_size 1 \
    --dropout_rate=0.30

I’m using DeepSpeech version 0.7.3 .

I wanted to train deepspeech model with my own domain specific data. So I want to add new vocabulary such as integers, (double quote) " and period(.) to the existing deepspeech alphabets given here .

As a first step before using my own data, I wanted to do transfer learning with the common voice data and new alphabets from existing published checkpoint give here. The intention is to use this new checkpoint to start training with my own data (since it has new alphabets) rather than using the deepspeech published checkpoint (since it has limited alphabets/vocabulary).

My new alphabets are here

# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
 
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
'
0
1
2
3
4
5
6
7
8
9
.
"
# The last (non-comment) line needs to end with a newline.

When I used the command

python3 DeepSpeech.py \
    --n_hidden 2048 \
    --drop_source_layers 1 \
    --alphabet_config_path data/new_alphabet.txt \
    --save_checkpoint_dir /data/Self/test/DeepSpeech/train_3/ \
    --load_checkpoint_dir /data/Self/test/DeepSpeech/checkpoint/ \
    --train_files   data/clips/train.csv \
    --dev_files   data/clips/dev.csv \
    --test_files  data/clips/test.csv \
    --learning_rate 0.000005 \
    --use_allow_growth true \
    --train_cudnn \
    --epochs 20 \
    --export_dir /data/Self/test/DeepSpeech/train_3/ \
    --summary_dir /data/Self/test/DeepSpeech/train_3/summary \
    --train_batch_size 32 \
    --dev_batch_size 32 \
    --test_batch_size 32 \
    --export_batch_size 1 \
    --dropout_rate=0.30

My training loss is starting from lower value and increasing as the epoch is under progress.

Output from terminal is as follows:

Epoch 0 |   Training | Elapsed Time: 0:48:14 | Steps: 7186 | Loss: 51.725979                                                                                                                                                                                                  
Epoch 0 | Validation | Elapsed Time: 0:01:25 | Steps: 475 | Loss: 44.295110 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                    
I Saved new best validating model with loss 44.295110 to: /data/Self/test/DeepSpeech/train_3/best_dev-739708
--------------------------------------------------------------------------------
Epoch 1 |   Training | Elapsed Time: 0:48:03 | Steps: 7186 | Loss: 33.749292                                                                                                                                                                                                  
Epoch 1 | Validation | Elapsed Time: 0:01:25 | Steps: 475 | Loss: 41.206871 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                    
I Saved new best validating model with loss 41.206871 to: /data/Self/test/DeepSpeech/train_3/best_dev-746894
--------------------------------------------------------------------------------
Epoch 2 |   Training | Elapsed Time: 0:48:01 | Steps: 7186 | Loss: 31.213234                                                                                                                                                                                                  
Epoch 2 | Validation | Elapsed Time: 0:01:24 | Steps: 475 | Loss: 39.810643 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                    
I Saved new best validating model with loss 39.810643 to: /data/Self/test/DeepSpeech/train_3/best_dev-754080
--------------------------------------------------------------------------------
Epoch 3 |   Training | Elapsed Time: 0:48:03 | Steps: 7186 | Loss: 29.791398                                                                                                                                                                                                  
Epoch 3 | Validation | Elapsed Time: 0:01:24 | Steps: 475 | Loss: 39.136365 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                    
I Saved new best validating model with loss 39.136365 to: /data/Self/test/DeepSpeech/train_3/best_dev-761266
--------------------------------------------------------------------------------
Epoch 4 |   Training | Elapsed Time: 0:48:02 | Steps: 7186 | Loss: 28.845716                                                                                                                                                                                                  
Epoch 4 | Validation | Elapsed Time: 0:01:25 | Steps: 475 | Loss: 38.489472 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                    
I Saved new best validating model with loss 38.489472 to: /data/Self/test/DeepSpeech/train_3/best_dev-768452
--------------------------------------------------------------------------------
Epoch 5 |   Training | Elapsed Time: 0:48:02 | Steps: 7186 | Loss: 28.051135                                                                                                                                                                                                  
Epoch 5 | Validation | Elapsed Time: 0:01:25 | Steps: 475 | Loss: 37.851685 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                    
I Saved new best validating model with loss 37.851685 to: /data/Self/test/DeepSpeech/train_3/best_dev-775638
--------------------------------------------------------------------------------
Epoch 6 |   Training | Elapsed Time: 0:48:02 | Steps: 7186 | Loss: 27.403971                                                                                                                                                                                                  
Epoch 6 | Validation | Elapsed Time: 0:01:25 | Steps: 475 | Loss: 37.467827 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                    
I Saved new best validating model with loss 37.467827 to: /data/Self/test/DeepSpeech/train_3/best_dev-782824
--------------------------------------------------------------------------------
Epoch 7 |   Training | Elapsed Time: 0:48:02 | Steps: 7186 | Loss: 26.854938                                                                                                                                                                                                  
Epoch 7 | Validation | Elapsed Time: 0:01:25 | Steps: 475 | Loss: 37.366411 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                    
I Saved new best validating model with loss 37.366411 to: /data/Self/test/DeepSpeech/train_3/best_dev-790010
--------------------------------------------------------------------------------
Epoch 8 |   Training | Elapsed Time: 0:48:02 | Steps: 7186 | Loss: 26.361723                                                                                                                                                                                                  
Epoch 8 | Validation | Elapsed Time: 0:01:25 | Steps: 475 | Loss: 37.046090 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                    
I Saved new best validating model with loss 37.046090 to: /data/Self/test/DeepSpeech/train_3/best_dev-797196
--------------------------------------------------------------------------------
Epoch 9 |   Training | Elapsed Time: 0:48:02 | Steps: 7186 | Loss: 25.926279                                                                                                                                                                                                  
Epoch 9 | Validation | Elapsed Time: 0:01:25 | Steps: 475 | Loss: 36.745385 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                    
I Saved new best validating model with loss 36.745385 to: /data/Self/test/DeepSpeech/train_3/best_dev-804382
--------------------------------------------------------------------------------
Epoch 10 |   Training | Elapsed Time: 0:48:08 | Steps: 7186 | Loss: 25.514988                                                                                                                                                                                                 
Epoch 10 | Validation | Elapsed Time: 0:01:25 | Steps: 475 | Loss: 36.442261 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                   
I Saved new best validating model with loss 36.442261 to: /data/Self/test/DeepSpeech/train_3/best_dev-811568
--------------------------------------------------------------------------------
Epoch 11 |   Training | Elapsed Time: 0:48:07 | Steps: 7186 | Loss: 25.151161                                                                                                                                                                                                 
Epoch 11 | Validation | Elapsed Time: 0:01:24 | Steps: 475 | Loss: 36.470721 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                   
--------------------------------------------------------------------------------
Epoch 12 |   Training | Elapsed Time: 0:48:01 | Steps: 7186 | Loss: 24.817111                                                                                                                                                                                                 
Epoch 12 | Validation | Elapsed Time: 0:01:24 | Steps: 475 | Loss: 36.185578 | Dataset: /home/tumu/Self/Research/Work/tensorflow_work/models/try/rnnt-speech-recognition/data/clips/dev.csv                                                                                   
I Saved new best validating model with loss 36.185578 to: /data/Self/test/DeepSpeech/train_3/best_dev-825940

I’m not getting good results with my new checkpoints (as I did the same exercise for 3 epochs and verified some results with Common speech audio files in train.tsv), seems I’m doing something wrong.

  1. When I add new alphabets, do I need to retrain the scorer as well ? My new alphabets are mostly (a-z, 0-9 and . , " , comma (,), semi-colon ) (few of like these) ?
  2. Do I need to change my training parameters to as may be I’m doing something wrong.
  3. I followed the procedure mentioned here. Also prepared the data using the command
    bin/import_cv2.py --filter_alphabet data/new_alphabet.txt <path_to_common_speech_tsv_files>
  4. Do I need to perform any additional steps or drop more layers to train the model with new alphabets ( new vocabulary) ? Also I would like to start from the deeplearning provided checkpoint rather than from scratch.

Please let me know how to proceed with training a model with new alphabets. Thank you

Thanks for the well written question, we got most of the info we need :slight_smile:

DeepSpeech tries to match letters to sounds, I personally have not heard of a training that uses 0-9 as int values. Usually you do sth. like num2words to transform them beforehand.

But you should try to run some inferences without a scorer to see what your trained model gives as output. If that is ok, you could build your custom scorer which you will need because the standard one outputs only lowercase letters a-z and ’

Hi @othiele,

Thank you very much. When the speech contains “Person is a ninety seven year old female”, I want to get the final text output as “Person is a 97 year old”. Similarly, if the audio contains “seventeenth of this month”, we need the final output to be “17th of the month”.

So based on what you are saying, deepspeech model should be just trained with alphabets (just convert numbers into alphabets in training data) and The model inference with the scorer will give final output as “Person is a ninety seven year old female” and then we have to do some post-processing to get numbers.
Sure I’ll try this approach.

  1. Also, Why is my training loss going up as the epochs are under progress ? at the beginning of each epoch, the training loss is low but as the epochs are under progress, the train loss is going up. The same happened for all of my 20 epochs ? Something wrong with my training/hyper parameter values ?

  2. When I train want to train with my own data, what should be the max length/duration (in seconds/milli seconds) of each audio file and min length (if any ). I’ve a pretty long audio file. Is there a tool you recommend in deepspeech repo or outside to split the long audio files into multiple small files such that the words do not get cut in the middle of a sentence ?

Yes, that’s the usual way to do it.

Both training and dev losses are decreasing in your data from epoch to epoch. This absolutely normal for ok data.

I usually take between 5 and 10 seconds of input. How you get that data is a whole science itself. Try DSAlign from the Mozilla guys for forced alignment. But depending on the data, this can be a lot of work.

@othiele, I had a follow-up quesiton on this note:

I usually take between 5 and 10 seconds of input. How you get that data is a whole science itself. Try DSAlign from the Mozilla guys for forced alignment. But depending on the data, this can be a lot of work.

Are these 5-10secs usually single speaker or do you mix speakers? Would it be ok to mix speakers? Also have you seen lower WER with longer sequences vs shorter?

thanks

Mixed speaker is ok for me. It more important to have good quality input, size is less important.