Cannot train pre-trained model even after using negative epochs

I am trying to train pretrained model on aws I downloaded check-point from this link https://github.com/mozilla/DeepSpeech/releases

> python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.4.1-checkpoint --epochs -3 --train_files clips/train.csv --dev_files clips/dev.csv --test_files clips/test.csv --learning_rate 0.00001 --export_dir model/

> Preprocessing ['clips/train.csv']
> Preprocessing done
> Preprocessing ['clips/dev.csv']
> Preprocessing done
> W Parameter --validation_step needs to be >0 for early stopping to work
> Preprocessing ['clips/test.csv']
> Preprocessing done
> Computing acoustic model predictions...
> 100% (25 of 25) |########################| Elapsed Time: 0:00:54 Time:  0:00:54
> Decoding predictions...
> 100% (25 of 25) |########################| Elapsed Time: 0:00:03 Time:  0:00:03
> Test - WER: 0.818182, CER: 5.840000, loss: 21.595797
> --------------------------------------------------------------------------------
> WER: 2.000000, CER: 4.000000, loss: 18.038776
>  - src: "knowlegde"
>  - res: "no ledge"
> --------------------------------------------------------------------------------
> WER: 2.000000, CER: 4.000000, loss: 20.510443
>  - src: "knowlegde"
>  - res: "no ledge"
> --------------------------------------------------------------------------------
> WER: 2.000000, CER: 6.000000, loss: 23.434345
>  - src: "article"
>  - res: "i do"
> --------------------------------------------------------------------------------
> WER: 2.000000, CER: 7.000000, loss: 39.423889
>  - src: "article"
>  - res: "i began"
> --------------------------------------------------------------------------------
> WER: 1.500000, CER: 8.000000, loss: 21.053690
>  - src: "articles run"
>  - res: "i tell en"
> --------------------------------------------------------------------------------
> WER: 1.500000, CER: 10.000000, loss: 40.492012
>  - src: "articles run"
>  - res: "i do concern"
> --------------------------------------------------------------------------------
> WER: 1.400000, CER: 20.000000, loss: 68.133987
>  - src: "aero are copyright files illegal"
>  - res: "ada or copperas piles in the good"
> --------------------------------------------------------------------------------
> WER: 1.000000, CER: 3.000000, loss: 2.642921
>  - src: "base"
>  - res: "these"
> --------------------------------------------------------------------------------
> WER: 1.000000, CER: 2.000000, loss: 2.887465
>  - src: "tab"
>  - res: "a"
> --------------------------------------------------------------------------------
> WER: 1.000000, CER: 2.000000, loss: 5.575221
>  - src: "hi"
>  - res: "it"
> --------------------------------------------------------------------------------
> I Exporting the model...
> I Models exported at model/

I dont know what is wrong. i have already trained on my laptop then I started from scratch in aws. but its not working
I found the similar question but it was because the guy that posted the question did not provide negative value as epochs but I have provided -3 as epochs.
what could be causing this erros, any ideas.

Can you elaborate ? What you show here is that it did export a model as expected. I’m not sure what you mean when you say “its not working”.

@lissyx but it should have started training

I started optimization
i am training on epoch something

its not starting epochs

We changed the behavior, are you running master or 0.4.1 ?

@lissyx I am using 0.4.1
thanks for fast reply

You’re mixing things up between master and v0.4.1. For example, in v0.4.1, there’s no --epochs flag.

Hello @reuben @lissyx
I am facing the same problem as @Sushantmkarande .
I have downloaded DeepSpeech 0.6.1 and deepspeech-0.6.1-checkpoints from DeepSpeech Release. I need to Fine Tune DeepSpeech by adding my own audio files on pretrained model (Continuing training from a release model).
When I give epochs value as negative, it starts testing but it should have started training and then testing.

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.6.1-checkpoint --epochs -10 --train_files …/deepspeech-0.6.1-models/audio/recordings_csv/train.csv --dev_files …/deepspeech-0.6.1-models/audio/recordings_csv/validate.csv --test_files …/deepspeech-0.6.1-models/audio/recordings_csv/test.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true --lm_binary_path …/deepspeech-0.6.1-models/lm.binary --lm_trie_path …/deepspeech-0.6.1-models/trie

I can’t install git lfs so I am using lm.binary and trie from deepspeech-0.6.1-models (pretrained).
When I train using +5 epoch, it got trained successfully but it works on trained data only, not on any other data (High WER).

This is not supported

Please give more context, nothing conclusive here.

Thanks for quick response @lissyx
I am training on https://www.kaggle.com/rtatman/speech-accent-archive . 1500+ files for training, 250 for validating and 300 for testing.

I trained using +5 epoch.

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.6.1-checkpoint --epochs 5 --train_files …/deepspeech-0.6.1-models/audio/recordings_csv/train.csv --dev_files …/deepspeech-0.6.1-models/audio/recordings_csv/validate.csv --test_files …/deepspeech-0.6.1-models/audio/recordings_csv/test.csv --learning_rate 0.0001 --use_cudnn_rnn true --use_allow_growth true --lm_binary_path …/deepspeech-0.6.1-models/lm.binary --lm_trie_path …/deepspeech-0.6.1-models/trie --export_dir exported_model/

It got trained successfully and “output_graph.pb” saved in “exported_model”. When I try to inference this, These is good accuracy on test.csv but failed to predict other audio.

Sorry for uploading the pic as I saved it earlier

That’s a small dataset

Maybe you want to lower that a bit more given the small amount of data. How was loss evolving between training and validation steps ?

Which other audio ?

Please share text output as proper console output with code formatting on discourse so we can actually use it.

@lissyx
Which other audio ?

I tried testing some audio files earlier using deepspeech-0.6.1-models / output_graph.pbmm file, the result was quite good on some audio files, but after training on my data (as mentioned above, fine tuning), it is’t predicting like before. It gives good result on the trained corpus audio files only. But gives very high WER and Loss on those audio file which I tried before using deepspeech-0.6.1-models / output_graph.pbmm.

Example: An audio file saying, “Your power is sufficient I said”.
Before Training on my own data and running with deepspeech-0.6.1-models / output_graph.pbmm Result: “your power is sufficient i said”

After training on my data, result on same audio: “the station”.

That would be consistent with small dataset and too high learning rate IMHO.

Hi @lissyx
I tried what you suggested and now able to train the model.

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir deepspeech-0.6.1-checkpoint --epochs 100 --train_files …/deepspeech-0.6.1-models/audio/fluent_speech/csv/train.csv --dev_files …/deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv --test_files …/deepspeech-0.6.1-models/audio/fluent_speech/csv/test.csv --learning_rate 0.000001 --use_cudnn_rnn true --use_allow_growth true --lm_binary_path …/deepspeech-0.6.1-models/lm.binary --lm_trie_path …/deepspeech-0.6.1-models/trie --train_batch_size 64 --dev_batch_size 64 --test_batch_size 64

My training data is: https://www.fluent.ai/research/fluent-speech-commands/

The model was training smoothly but after some epochs, early stopping triggerd and it returns some error without exporting the .pb file.

Error:

I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:02:01 | Steps: 361 | Loss: 15.122960                                                                                  
Epoch 0 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 15.799511 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv        
I Saved new best validating model with loss 15.799511 to: deepspeech-0.6.1-checkpoint/best_dev-234145
Epoch 1 |   Training | Elapsed Time: 0:01:58 | Steps: 361 | Loss: 10.945183                                                                                  
Epoch 1 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 13.657758 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv        
WARNING:tensorflow:From /home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py:963: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.
W0214 11:39:42.411926 139768476886848 deprecation.py:323] From /home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_core/python/training/saver.py:963: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.
I Saved new best validating model with loss 13.657758 to: deepspeech-0.6.1-checkpoint/best_dev-234506
Epoch 2 |   Training | Elapsed Time: 0:01:58 | Steps: 361 | Loss: 9.513324                                                                                   
Epoch 2 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 12.472702 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv        
I Saved new best validating model with loss 12.472702 to: deepspeech-0.6.1-checkpoint/best_dev-234867
Epoch 3 |   Training | Elapsed Time: 0:01:58 | Steps: 361 | Loss: 8.588798                                                                                   
Epoch 3 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 11.621569 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv        
I Saved new best validating model with loss 11.621569 to: deepspeech-0.6.1-checkpoint/best_dev-235228
Epoch 4 |   Training | Elapsed Time: 0:01:58 | Steps: 361 | Loss: 7.921981                                                                                   
Epoch 4 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 10.949868 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv        
I Saved new best validating model with loss 10.949868 to: deepspeech-0.6.1-checkpoint/best_dev-235589
Epoch 5 |   Training | Elapsed Time: 0:01:58 | Steps: 361 | Loss: 7.381936                                                                                   
Epoch 5 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 10.450499 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv        
I Saved new best validating model with loss 10.450499 to: deepspeech-0.6.1-checkpoint/best_dev-235950
Epoch 6 |   Training | Elapsed Time: 0:01:58 | Steps: 361 | Loss: 6.925110                                                                                   
Epoch 6 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 9.995356 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv         
I Saved new best validating model with loss 9.995356 to: deepspeech-0.6.1-checkpoint/best_dev-236311
Epoch 7 |   Training | Elapsed Time: 0:01:58 | Steps: 361 | Loss: 6.541599                                                                                   
Epoch 7 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 9.585989 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv         
I Saved new best validating model with loss 9.585989 to: deepspeech-0.6.1-checkpoint/best_dev-236672
Epoch 8 |   Training | Elapsed Time: 0:01:57 | Steps: 361 | Loss: 6.200359                                                                                   
Epoch 8 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 9.234038 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv         
I Saved new best validating model with loss 9.234038 to: deepspeech-0.6.1-checkpoint/best_dev-237033
Epoch 9 |   Training | Elapsed Time: 0:01:57 | Steps: 361 | Loss: 5.890317                                                                                   
Epoch 9 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 8.920528 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv         
I Saved new best validating model with loss 8.920528 to: deepspeech-0.6.1-checkpoint/best_dev-237394
Epoch 10 |   Training | Elapsed Time: 0:01:57 | Steps: 361 | Loss: 5.630370                                                                                  
Epoch 10 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 8.636955 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv        
I Saved new best validating model with loss 8.636955 to: deepspeech-0.6.1-checkpoint/best_dev-237755
Epoch 11 |   Training | Elapsed Time: 0:01:58 | Steps: 361 | Loss: 5.385189                                                                                  
Epoch 11 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 8.378594 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv        
I Saved new best validating model with loss 8.378594 to: deepspeech-0.6.1-checkpoint/best_dev-238116
Epoch 12 |   Training | Elapsed Time: 0:01:58 | Steps: 361 | Loss: 5.166077                                                                                  
Epoch 12 | Validation | Elapsed Time: 0:00:06 | Steps: 48 | Loss: 8.148501 | Dataset: ../deepspeech-0.6.1-models/audio/fluent_speech/csv/validate.csv        
I Saved new best validating model with loss 8.148501 to: deepspeech-0.6.1-checkpoint/best_dev-238477
I Early stop triggered as (for last 4 steps) validation loss: 8.148501 with standard deviation: 0.221324 and mean: 8.645359
I FINISHED optimization in 0:27:50.338386
WARNING:tensorflow:From /home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_core/contrib/rnn/python/ops/lstm_ops.py:597: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
W0214 12:03:12.865196 139768476886848 deprecation.py:323] From /home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_core/contrib/rnn/python/ops/lstm_ops.py:597: Layer.add_variable (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
Please use `layer.add_weight` method instead.
INFO:tensorflow:Restoring parameters from deepspeech-0.6.1-checkpoint/best_dev-238477
I0214 12:03:12.936584 139768476886848 saver.py:1284] Restoring parameters from deepspeech-0.6.1-checkpoint/best_dev-238477
I Restored variables from best validation checkpoint at deepspeech-0.6.1-checkpoint/best_dev-238477, step 238477
Testing model on ../deepspeech-0.6.1-models/audio/fluent_speech/csv/test.csv
Test epoch | Steps: 50 | Elapsed Time: 0:04:54                                                                                                               Traceback (most recent call last):
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/ds_ctcdecoder/swigwrapper.py", line 581, in <lambda>
    __setattr__ = lambda self, name, value: _swig_setattr(self, OutputVectorVector, name, value)
KeyboardInterrupt

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "DeepSpeech.py", line 974, in <module>
    absl.app.run(main)
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 951, in main
    test()
  File "DeepSpeech.py", line 684, in test
    samples = evaluate(FLAGS.test_files.split(','), create_model, try_loading)
  File "/home/glmr/glmShare/AzeemData/DeepSpeech-0.6.1/evaluate.py", line 155, in evaluate
    samples.extend(run_test(init_op, dataset=csv))
  File "/home/glmr/glmShare/AzeemData/DeepSpeech-0.6.1/evaluate.py", line 122, in run_test
    cutoff_prob=FLAGS.cutoff_prob, cutoff_top_n=FLAGS.cutoff_top_n)
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/ds_ctcdecoder/__init__.py", line 116, in ctc_beam_search_decoder_batch
    batch_beam_results = swigwrapper.ctc_beam_search_decoder_batch(probs_seq, seq_lengths, native_alphabet, beam_size, num_processes, cutoff_prob, cutoff_top_n, scorer)
SystemError: <built-in function ctc_beam_search_decoder_batch> returned a result with an error set

No idea why … Can you verify your ds_ctcdecoder package is the current one ? Can you reproduce without anaconda? but plain python virtualenv ?

Can you verify your ds_ctcdecoder package is the current one ?

Yes Sure!

Can you reproduce without anaconda?

I will try in some other PC.
Thank You for quick response.
If you get something related to this error, plz let me know.

@lissyx
Although, I also tried using --noearly_stop, but got different error.

I Restored variables from best validation checkpoint at deepspeech-0.6.1-checkpoint/best_dev-255083, step 255083
Testing model on ../deepspeech-0.6.1-models/audio/fluent_speech/csv/test.csv
Test epoch | Steps: 51 | Elapsed Time: 0:04:57                                                                                                               Traceback (most recent call last):
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/ds_ctcdecoder/swigwrapper.py", line 581, in <lambda>
    __setattr__ = lambda self, name, value: _swig_setattr(self, OutputVectorVector, name, value)
KeyboardInterrupt

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "DeepSpeech.py", line 974, in <module>
    absl.app.run(main)
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "DeepSpeech.py", line 951, in main
    test()
  File "DeepSpeech.py", line 684, in test
    samples = evaluate(FLAGS.test_files.split(','), create_model, try_loading)
  File "/home/glmr/glmShare/AzeemData/DeepSpeech-0.6.1/evaluate.py", line 155, in evaluate
    samples.extend(run_test(init_op, dataset=csv))
  File "/home/glmr/glmShare/AzeemData/DeepSpeech-0.6.1/evaluate.py", line 122, in run_test
    cutoff_prob=FLAGS.cutoff_prob, cutoff_top_n=FLAGS.cutoff_top_n)
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/ds_ctcdecoder/__init__.py", line 116, in ctc_beam_search_decoder_batch
    batch_beam_results = swigwrapper.ctc_beam_search_decoder_batch(probs_seq, seq_lengths, native_alphabet, beam_size, num_processes, cutoff_prob, cutoff_top_n, scorer)
SystemError: <built-in function ctc_beam_search_decoder_batch> returned a result with an error set

I think, there is a problem in ctc_decoder file.
What do you think?

That’s what the error says … Whether it fails because of a bug in our code or because of something on your system is still undecided. Please test what I asked for.

Hello @lissyx

python3 DeepSpeech.py --n_hidden 2048 --checkpoint_dir checkpoints/vlsi2 --epochs 40 --train_files …/deepspeech-0.6.1-models/audio/vlsi/train.csv --dev_files …/deepspeech-0.6.1-models/audio/vlsi/validate.csv --test_files …/deepspeech-0.6.1-models/audio/vlsi/test.csv --learning_rate 0.000001 --use_cudnn_rnn true --use_allow_growth true --lm_binary_path …/deepspeech-0.6.1-models/lm.binary --lm_trie_path …/deepspeech-0.6.1-models/trie --noearly_stop --export_dir exported_model/vlsi2 --train_batch_size 128 --dev_batch_size 128 --test_batch_size 128

I have fine-tuned the model successfully, it created the checkpoints also. But after training and validating, the testing part failed with below error:

^CTraceback (most recent call last):
  File "DeepSpeech.py", line 16, in <module>
import tensorflow as tf
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow/__init__.py", line 99, in <module>
from tensorflow_core import *
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_core/__init__.py", line 36, in <module>
from tensorflow._api.v1 import compat
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_core/_api/v1/compat/__init__.py", line 23, in <module>
from tensorflow._api.v1.compat import v1
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_core/_api/v1/compat/v1/__init__.py", line 672, in <module>
from tensorflow_estimator.python.estimator.api._v1 import estimator
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_estimator/__init__.py", line 10, in <module>
from tensorflow_estimator._api.v1 import estimator
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_estimator/_api/v1/estimator/__init__.py", line 12, in <module>
from tensorflow_estimator._api.v1.estimator import inputs
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_estimator/_api/v1/estimator/inputs/__init__.py", line 10, in <module>
from tensorflow_estimator.python.estimator.inputs.numpy_io import numpy_input_fn
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/inputs/numpy_io.py", line 26, in <module>
from tensorflow_estimator.python.estimator.inputs.queues import feeding_functions
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/inputs/queues/feeding_functions.py", line 40, in <module>
import pandas as pd
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/pandas/__init__.py", line 147, in <module>
from pandas.io.api import (
  File "/home/glmr/anaconda3/envs/azeem_vir_env/lib/python3.6/site-packages/pandas/io/api.py", line 11, in <module>
from pandas.io.html import read_html
  File "<frozen importlib._bootstrap>", line 971, in _find_and_load
  File "<frozen importlib._bootstrap>", line 955, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 665, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
  File "<frozen importlib._bootstrap_external>", line 764, in get_code
KeyboardInterrupt

I just have one constraint with my GPU, if it is not being used for five minutes, it will be terminated.

What’s that ?

I don’t see how we can help there.

Hi @lissyx

I think, evaluate.py is not using the GPU, so it automatically interrupts the running process. That’s why it is showing “Keyboard Interrupt”.

But I got the solution, I tested it using evaluate.py and while fine-tuning, I am not providing "--test_batch_size". This technique worked for me.