Fine Tuning with limited data - Questions on Fine Tuning in General

Edit :
Following What and how to report:

  • Mozilla STT version: DeepSpeech 0.8.2
  • OS: Linux Mint 19.1 Tessa
  • Python 3.6.9
  • Tensorflow 1.15.2
  • not using GPU

Hello I’ve been looking into DS for a while.

I’ve installed DeepSpeech with pip.
pip3 install deepspeech
and downloaded both the pre-trained model and the scorer from the latest release (v0.8.2).

I ran inference with
deepspeech --model /my/path/to/deepspeech-0.8.2-models.pbmm --scorer /my/path/to/deepspeech-0.8.2-models.scorer --audio /my/path/to/myaudio.wav
and got the following results:

original  : "in the first three forms we copy the tree to the entries"
inference : "in the first three fortunes we copied the three to the empress"

original  : "if you already downloaded and used the tools for something else"
inference : "if you are a detonation he the tools for something else"

original  : "you are happily working on something and find the changes in these files are in good order"
inference : "you are happily working on something and find the changes in this files are in good order"

original  : "this is most often done when you remembered what you just committed is incomplete"
inference : "this is most often done when you remembered what you just committed to is incomplete"

In the above files, I’m trying my best to get results, by mimicking a US accent.
(Before that, I had tested DS with my normal (greek) accent with worse results)

I wasn’t exactly satisfied with my results with accented speech so I looked around to find a solution.
I have read the documentation and focused on the “Fine-tuning” section.
I am wondering if it’s possible to fine tune a model with limited data from a single speaker.
And by that I mean:

  • gather transcribed audio from a single person, preferably small in size (around 2-5 minutes of speech)
  • fine tune a model (from the final checkpoint) in a “small” amount of time (hopefully less than 30 minutes)

Since I don’t have experience in speech recognition, or in NN’s training,
I will try to document my every step so the experts can point out my mistakes :slight_smile:

I’ve followed the documentation steps,

  • git clone https://github.com/mozilla/DeepSpeech
  • python3 -m venv $HOME/tmp/deepspeech-train-venv/
  • source $HOME/tmp/deepspeech-train-venv/bin/activate
  •     cd DeepSpeech
        pip3 install --upgrade pip==20.0.2 wheel==0.34.2 setuptools==46.1.3
        pip3 install --upgrade -e .
    
  • I have python3-dev
  • don’t have CUDA (so I used --load_cudnn later), so my times are acceptably bad.
  • skipped Dockerfile, as in make Dockerfile.train
  • downloaded the checkpoint model, pre-trained model and scorer from latest release (v0.8.2)
  • prepared data:
    I’ve seen some posts for splitting your corpus to a 7:2:1 or 8:1:1 ratio for train : dev : test respectively,
    and some posts (as well as the release documentation) used the same set for validation (dev) and the same for testing (test).
    So I did the same (including the csv and audio files bellow)
    with two directories /train and /dev, and the *.wav file paths in the csv are relative.
    I used wc -c to get the audio file sizes in bytes and used them in the csv files.
    I wondered if I should pass something in hyperparameter --read_buffer, but it seemed to work fine as is.
  •     python3 DeepSpeech.py --load_cudnn --n_hidden 2048 --checkpoint_dir /my/path/to/deepspeech-0.8.2-checkpoint/ --epochs 3 --train_files /my/path/to/train/train.csv --dev_files /my/path/to/dev/dev.csv --test_files /my/path/to/dev/dev.csv --learning_rate 0.0001 --export_dir /my/path/to/deepspeech-0.8.2-myname --scorer /my/path/to/deepspeech-0.8.2-models.scorer
    
    

I included a file with the output of the fine-tuning (the command above),
but here are the results (paths are hidden):

        Epoch 0 |   Training | Elapsed Time: 0:10:18 | Steps: 20 | Loss: 63.334317                                                                                                                                                                   
        Epoch 0 | Validation | Elapsed Time: 0:00:11 | Steps: 5 | Loss: 63.075968 | Dataset:/my/path/to/dev/dev.csv                                                                                   
        I Saved new best validating model with loss 63.075968 to: /my/path/to/deepspeech-0.8.2-checkpoint/best_dev-732562
        --------------------------------------------------------------------------------
        Epoch 1 |   Training | Elapsed Time: 0:09:55 | Steps: 20 | Loss: 72.434603                                                                                                                                                                   
        Epoch 1 | Validation | Elapsed Time: 0:00:11 | Steps: 5 | Loss: 89.295271 | Dataset: /my/path/to/dev/dev.csv                                                                                   
        --------------------------------------------------------------------------------
        Epoch 2 |   Training | Elapsed Time: 0:10:05 | Steps: 20 | Loss: 42.426105                                                                                                                                                                   
        Epoch 2 | Validation | Elapsed Time: 0:00:12 | Steps: 5 | Loss: 69.895619 | Dataset: /my/path/to/dev/dev.csv                                                                                   
        --------------------------------------------------------------------------------
        I FINISHED optimization in 0:31:04.132895
        I Loading best validating checkpoint from /my/path/to/deepspeech-0.8.2-checkpoint/best_dev-732562
        I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/bias
        I Loading variable from checkpoint: cudnn_lstm/rnn/multi_rnn_cell/cell_0/cudnn_compatible_lstm_cell/kernel
        I Loading variable from checkpoint: global_step
        I Loading variable from checkpoint: layer_1/bias
        I Loading variable from checkpoint: layer_1/weights
        I Loading variable from checkpoint: layer_2/bias
        I Loading variable from checkpoint: layer_2/weights
        I Loading variable from checkpoint: layer_3/bias
        I Loading variable from checkpoint: layer_3/weights
        I Loading variable from checkpoint: layer_5/bias
        I Loading variable from checkpoint: layer_5/weights
        I Loading variable from checkpoint: layer_6/bias
        I Loading variable from checkpoint: layer_6/weights
        Testing model on /my/path/to/dev/dev.csv
        Test epoch | Steps: 5 | Elapsed Time: 0:00:12                                                                                                                                                                                                
        Test on /my/path/to/dev/dev.csv - WER: 0.611111, CER: 0.375887, loss: 63.075970
        --------------------------------------------------------------------------------
        Best WER:
        --------------------------------------------------------------------------------
        WER: 0.400000, CER: 0.291667, loss: 24.661560
         - wav: file:///my/path/to//dev/utterance_0.wav
         - src: "training a model is easy"
         - res: "train a model is these"
        --------------------------------------------------------------------------------
        WER: 0.500000, CER: 0.451613, loss: 31.155817
         - wav: file:///my/path/to/dev/utterance_2.wav
         - src: "be quiet and only report errors"
         - res: "be quiet and alligator"
        --------------------------------------------------------------------------------
        WER: 0.583333, CER: 0.392857, loss: 56.349792
         - wav: file:///my/path/to/dev/utterance_1.wav
         - src: "in the first three forms we copy the tree to the entries"
         - res: "in the first three fours we go petitoire"
        --------------------------------------------------------------------------------
        WER: 0.647059, CER: 0.400000, loss: 121.987396
         - wav: file:///my/path/to/dev/utterance_3.wav
         - src: "you are happily working on something and find the changes in these files are in good order"
         - res: "you are a believer in enforcing and find the chances in this is a bamboo"
        --------------------------------------------------------------------------------
        WER: 0.714286, CER: 0.333333, loss: 81.225273
         - wav: file:///my/path/to/dev/utterance_4.wav
         - src: "this is most often done when you remembered what you just committed is incomplete"
         - res: "this is not an den i omeme what you just come he is in complete"
        --------------------------------------------------------------------------------
        Median WER:
        --------------------------------------------------------------------------------
        WER: 0.400000, CER: 0.291667, loss: 24.661560
         - wav: file:///my/path/to/dev/utterance_0.wav
         - src: "training a model is easy"
         - res: "train a model is these"
        --------------------------------------------------------------------------------
        WER: 0.500000, CER: 0.451613, loss: 31.155817
         - wav: file:///my/path/to/dev/utterance_2.wav
         - src: "be quiet and only report errors"
         - res: "be quiet and alligator"
        --------------------------------------------------------------------------------
        WER: 0.583333, CER: 0.392857, loss: 56.349792
         - wav: file:///my/path/to/dev/utterance_1.wav
         - src: "in the first three forms we copy the tree to the entries"
         - res: "in the first three fours we go petitoire"
        --------------------------------------------------------------------------------
        WER: 0.647059, CER: 0.400000, loss: 121.987396
         - wav: file:///my/path/to/dev/utterance_3.wav
         - src: "you are happily working on something and find the changes in these files are in good order"
         - res: "you are a believer in enforcing and find the chances in this is a bamboo"
        --------------------------------------------------------------------------------
        WER: 0.714286, CER: 0.333333, loss: 81.225273
         - wav: file:///my/path/to/dev/utterance_4.wav
         - src: "this is most often done when you remembered what you just committed is incomplete"
         - res: "this is not an den i omeme what you just come he is in complete"
        --------------------------------------------------------------------------------
        Worst WER:
        --------------------------------------------------------------------------------
        WER: 0.400000, CER: 0.291667, loss: 24.661560
         - wav: file:///my/path/to/dev/utterance_0.wav
         - src: "training a model is easy"
         - res: "train a model is these"
        --------------------------------------------------------------------------------
        WER: 0.500000, CER: 0.451613, loss: 31.155817
         - wav: file:///my/path/to/dev/utterance_2.wav
         - src: "be quiet and only report errors"
         - res: "be quiet and alligator"
        --------------------------------------------------------------------------------
        WER: 0.583333, CER: 0.392857, loss: 56.349792
         - wav: file:///my/path/to/dev/utterance_1.wav
         - src: "in the first three forms we copy the tree to the entries"
         - res: "in the first three fours we go petitoire"
        --------------------------------------------------------------------------------
        WER: 0.647059, CER: 0.400000, loss: 121.987396
         - wav: file:///my/path/to/dev/utterance_3.wav
         - src: "you are happily working on something and find the changes in these files are in good order"
         - res: "you are a believer in enforcing and find the chances in this is a bamboo"
        --------------------------------------------------------------------------------
        WER: 0.714286, CER: 0.333333, loss: 81.225273
         - wav: file:///my/path/to/dev/utterance_4.wav
         - src: "this is most often done when you remembered what you just committed is incomplete"
         - res: "this is not an den i omeme what you just come he is in complete"
        --------------------------------------------------------------------------------

I searched around for a similar question or documentation on small amount of data, but didn’t find any.
So here are my questions:

  • did I miss a step somewhere?
  • is it possible to fine-tune a model with a small amount of data and time?
    if so, do we have a lower threshold for the training set size or a similar documentation/paper I can read?
  • is there any correlation between the amount of training data and the epochs? Which epochs value would be suggested?
  • is my accent too heavy and/or not included in training set? (I have a greek accent)
  • can the checkpoint model(v0.8.2) be considered equivalent to the pre-trained model v0.8.2)?
    After testing them both I got the same results, but I wanted to make sure:
    • For the inference (pre-trained model) I used:
      deepspeech --model /my/path/to/deepspeech-0.8.2-models.pbmm --scorer /my/path/to/deepspeech-0.8.2-models.scorer --audio /my/path/to/dev/utterance_0.wav
      (run on all the audio files _0 to _4) I got:
        original  : "training a model is easy"
        inference : "training a model is easy"
    
        original  : "in the first three forms we copy the tree to the entries"
        inference : "in the first three fortunes we copied the three to the empress"
    
        original  : "be quiet and only report errors"
        inference : "be quiet and only report terrors"
    
        original  : "you are happily working on something and find the changes in these files are in good order"
        inference : "you are happily working on something and find the changes in this files are in good order"
    
        original  : "this is most often done when you remembered what you just committed is incomplete"
        inference : "this is most often done when you remembered what you just committed to is incomplete"
    
    
    • For the WER (checkpoint model) I used:
      DeepSpeech.py --load_cudnn --checkpoint_dir /my/path/to/deepspeech-0.8.2-checkpoint/ --test_files /my/path/to/dev/dev.csv --scorer /my/path/to/deepspeech-0.8.2-models.scorer
      (the same audio files are in the /dev/dev.csv)
        Testing model on /my/path/to/dev/dev.csv
        Test epoch | Steps: 5 | Elapsed Time: 0:00:12
        Test on /my/path/to/dev/dev.csv - WER: 0.611111, CER: 0.375887, loss: 63.075970
        --------------------------------------------------------------------------------
        Best WER: 
        --------------------------------------------------------------------------------
        WER: 0.000000, CER: 0.000000, loss: 1.462615
         - wav: file:///my/path/to/dev/utterance_0.wav
         - src: "training a model is easy"
         - res: "training a model is easy"
        --------------------------------------------------------------------------------
        WER: 0.058824, CER: 0.022222, loss: 17.974970
         - wav: file:///my/path/to/dev/utterance_3.wav
         - src: "you are happily working on something and find the changes in these files are in good order"
         - res: "you are happily working on something and find the changes in this files are in good order"
        --------------------------------------------------------------------------------
        WER: 0.071429, CER: 0.037037, loss: 14.509259
         - wav: file:///my/path/to/dev/utterance_4.wav
         - src: "this is most often done when you remembered what you just committed is incomplete"
         - res: "this is most often done when you remembered what you just committed to is incomplete"
        --------------------------------------------------------------------------------
        WER: 0.166667, CER: 0.032258, loss: 4.264238
         - wav: file:///my/path/to/dev/utterance_2.wav
         - src: "be quiet and only report errors"
         - res: "be quiet and only report terrors"
        --------------------------------------------------------------------------------
        WER: 0.333333, CER: 0.214286, loss: 40.977341
         - wav: file:///my/path/to/dev/utterance_1.wav
         - src: "in the first three forms we copy the tree to the entries"
         - res: "in the first three fortunes we copied the three to the empress"
        --------------------------------------------------------------------------------
        Median WER: 
        --------------------------------------------------------------------------------
        WER: 0.000000, CER: 0.000000, loss: 1.462615
         - wav: file:///my/path/to/dev/utterance_0.wav
         - src: "training a model is easy"
         - res: "training a model is easy"
        --------------------------------------------------------------------------------
        WER: 0.058824, CER: 0.022222, loss: 17.974970
         - wav: file:///my/path/to/dev/utterance_3.wav
         - src: "you are happily working on something and find the changes in these files are in good order"
         - res: "you are happily working on something and find the changes in this files are in good order"
        --------------------------------------------------------------------------------
        WER: 0.071429, CER: 0.037037, loss: 14.509259
         - wav: file:///my/path/to/dev/utterance_4.wav
         - src: "this is most often done when you remembered what you just committed is incomplete"
         - res: "this is most often done when you remembered what you just committed to is incomplete"
        --------------------------------------------------------------------------------
        WER: 0.166667, CER: 0.032258, loss: 4.264238
         - wav: file:///my/path/to/dev/utterance_2.wav
         - src: "be quiet and only report errors"
         - res: "be quiet and only report terrors"
        --------------------------------------------------------------------------------
        WER: 0.333333, CER: 0.214286, loss: 40.977341
         - wav: file:///my/path/to/dev/utterance_1.wav
         - src: "in the first three forms we copy the tree to the entries"
         - res: "in the first three fortunes we copied the three to the empress"
        --------------------------------------------------------------------------------
        Worst WER: 
        --------------------------------------------------------------------------------
        WER: 0.000000, CER: 0.000000, loss: 1.462615
         - wav: file:///my/path/to/dev/utterance_0.wav
         - src: "training a model is easy"
         - res: "training a model is easy"
        --------------------------------------------------------------------------------
        WER: 0.058824, CER: 0.022222, loss: 17.974970
         - wav: file:///my/path/to/dev/utterance_3.wav
         - src: "you are happily working on something and find the changes in these files are in good order"
         - res: "you are happily working on something and find the changes in this files are in good order"
        --------------------------------------------------------------------------------
        WER: 0.071429, CER: 0.037037, loss: 14.509259
         - wav: file:///my/path/to/dev/utterance_4.wav
         - src: "this is most often done when you remembered what you just committed is incomplete"
         - res: "this is most often done when you remembered what you just committed to is incomplete"
        --------------------------------------------------------------------------------
        WER: 0.166667, CER: 0.032258, loss: 4.264238
         - wav: file:///my/path/to/dev/utterance_2.wav
         - src: "be quiet and only report errors"
         - res: "be quiet and only report terrors"
        --------------------------------------------------------------------------------
        WER: 0.333333, CER: 0.214286, loss: 40.977341
         - wav: file:///my/path/to/dev/utterance_1.wav
         - src: "in the first three forms we copy the tree to the entries"
         - res: "in the first three fortunes we copied the three to the empress"
        --------------------------------------------------------------------------------
    
    
  • does the fine-tuning process modify the checkpoint model?
  • in this post there is a suggestion to use some of Common Voice data set when fine-tuning,
    however it is unclear to me if this suggestion is retracted at the end. So rephrasing the post:
    Is it a good idea to mix some data of the Common Voice data set (or any other data set used in the training) PLUS the speaker’s data when fine-tuning?
    (if it’s a good idea, I’m guessing I should find similar accented data with the speaker’s accent)

Thank you very much!

Edit: it seems I’m a new user in Discourse so i cannot upload any files yet…
I’ll just describe them and give you a sample (no audio sadly).
In the train directory i have:

  • train.csv
  • 20 wav files (utterance_0.wav - utterance_19.wav)

here is a preview of train.csv

wav_filename,wav_filesize,transcript
utterance_0.wav,159788,author of the danger trail philip steels etc
utterance_1.wav,196652,not at this particular case tom apologized whittemore
utterance_2.wav,151596,for the twentieth time that evening the two men shook hands
utterance_3.wav,151596,lord but i'm glad to see you again phil
utterance_4.wav,86060,will we ever forget it
utterance_5.wav,159788,god bless 'em i hope i'll go on seeing them forever
utterance_6.wav,155692,and you always want to see it in the superlative degree
utterance_7.wav,114732,gad your letter came just in time

In the dev directory i have:

  • dev.csv
  • 5 wav files (utterance_0.wav - utterance_4.wav)

here is the whole dev.csv

wav_filename,wav_filesize,transcript
utterance_0.wav,98348,training a model is easy
utterance_1.wav,163884,in the first three forms we copy the tree to the entries
utterance_2.wav,131116,be quiet and only report errors
utterance_3.wav,225324,you are happily working on something and find the changes in these files are in good order
utterance_4.wav,225324,this is most often done when you remembered what you just committed is incomplete

(in case you are wondering, the training text is from CMU Sphinx tutorial, from the CMU Arctic Data set)

Thank your for reading our guidelines and posting these details, that makes it easier to help.

No, looks good what you did.

Usually you would use tens of hours, so maybe a couple thousand files. I don’t know whether such few files can have an impact.

@lissyx might know whether there is something out there, but you would have to program that yourself, it is not built in. Prepare vor C++ :slight_smile:

Yes and no, ideally you would train a couple thousand files for 10-20 epochs and check whether it still improves or you overfit. 3 are way too few.

DeepSpeech is known to have problems with accents. So no, the model is bad at accents.

Yes, that is right. Sometimes loading the data before it is fed to the model can lead to changes as well as the beam search in the language model afterwards.

No, it will save the changes in the next checkpoint.

Only, if the Common Voice data has the feature you want to train. If it is a Greek accent you would need such data.

Finally, you didn’t ask about the learning rate, but sth like e-03 seems too harsh. Use e-04, e-05 as this changes the existing net. So if you use a high learning rate you destroy a lot of information already in the net. And because you have just a few files you replace that with limited information.

All the best and ask more if there are still questions. And even I can’t post files here :slight_smile:

Ok, thank you for your clear answer!
I found this issue on “Explaining model limitations” and re-read the documentation on v0.8.2 :

Note that the model currently performs best in low-noise environments with clear recordings and has a bias towards US male accents. This does not mean the model cannot be used outside of these conditions, but that accuracy may be lower. Some users may need to train the model further to meet their intended use-case.

this note, together with your explanations, made pretty clear why the model has bad accuracy on accented speech.

Can we expect a wide-range (on accents) English pre-trained Model on release v1.0?

Unfortunately not really, because the project will be decoupled from Mozilla at some point so progress is not what it used to be :frowning: Even though the remaining staff do what they can.

Best you can do is to ask other Greeks to donate voice and with that data you/we can train more models: