I started training from scratch but it is giving error while exporting the model

i used this :slight_smile:

./DeepSpeech.py --train_files my-train.csv --dev_files my-dev.csv  --epochs 3  --save_checkpoint_dir ../checkpoint/ --train_cudnn true --export_dir ../checkpoint/ --alphabet_config_path /home/dimanshu/alpha.txt

output is -

I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:01:04 | Steps: 410 | Loss: 114.940936                                                                                                                                             
Epoch 0 | Validation | Elapsed Time: 0:00:00 | Steps: 10 | Loss: 152.060320 | Dataset: my-dev.csv                                                                                                                        
I Saved new best validating model with loss 152.060320 to: ../checkpoint/best_dev-410
Epoch 1 |   Training | Elapsed Time: 0:01:00 | Steps: 410 | Loss: 111.319498                                                                                                                                             
Epoch 1 | Validation | Elapsed Time: 0:00:00 | Steps: 10 | Loss: 144.299709 | Dataset: my-dev.csv                                                                                                                        
I Saved new best validating model with loss 144.299709 to: ../checkpoint/best_dev-820
Epoch 2 |   Training | Elapsed Time: 0:00:42 | Steps: 336 | Loss: 110.281201                                                                                                                                             Epoch 2 |   Training | Elapsed Time: 0:00:42 | Steps: 337 | Loss: 110.260607                                                                                        Epoch 2 |   Training | Elapsed Time: 0:01:00 | Steps: 410 | Loss: 111.206717                                                                                                                                             
Epoch 2 | Validation | Elapsed Time: 0:00:00 | Steps: 10 | Loss: 141.602309 | Dataset: my-dev.csv                                                                                                                        
I Saved new best validating model with loss 141.602309 to: ../checkpoint/best_dev-1230
I FINISHED optimization in 0:03:18.015672
I Exporting the model...
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
E All initialization methods failed (['best', 'last']).

When loading checkpoints, the code respects the --load_checkpoint_dir flag. When saving, it respects the --save_checkpoint_dir flag. You should be able to run again with --load_checkpoint_dir and the export flags, and it’ll pick up the checkpoint saved during training.

thanks @reuben you solved my problem

there is one more problem on different issue :-

can you please look into this

hey @reuben
i trainined on existing model ;-

./DeepSpeech.py --n_hidden 2048 --save_checkpoint_dir /home/dimanshu/latestcheckpoiint/checkpoint --load_checkpoint_dir /home/dimanshu/latestcheckpoiint/checkpoint --epochs 100 --train_files my-train.csv --dev_files my-dev.csv --test_files my-test.csv --learning_rate 0.0001 --train_cudnn true --alphabet_config_path /home/dimanshu/alpha.txt --export_dir /home/dimanshu/latestcheckpoiint/checkpoint



but after completing  the  training when im checking with sample data it is showing no result

--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 137.519836
 - wav: file:///home/dimanshu/mydatadeepspeech/youtube-course-1/final_sound/5c45ebc9-8e10-4079-9a03-0688fbc3b96c.wav
 - src: "every literals this called axiom now in"
 - res: ""


Loading model from file /home/dimanshu/latestcheckpoiint/checkpoint/output_graph.pb
TensorFlow: v1.14.0-21-ge77504a
DeepSpeech: v0.6.1-0-g3df20fe
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2020-04-23 07:58:18.297305: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.146s.
Running inference.
  
Inference took 4.137s for 2.490s audio file.

it is not showing any result

It looks like you have very few input files. How many hours do you use for input and what do you what do you want to do with the model?

training data =20 k files
dev = 4k
test =1.5k

@lissyx suggested me to fine tune my model

python3 DeepSpeech.py --drop_source_layers 1 --alphabet_config_path /home/dimanshu/alpha.txt  --load_checkpoint_dir /home/dimanshu/latestcheckpoiint/checkpoint  --save_checkpoint_dir /home/dimanshu/latestcheckpoiint/checkpoint/ --train_files train.csv   --test_files test.csv --dev_files dev.csv --train_cudnn true --export_dir /home/dimanshu/best_path

one epoch is taking 40mins to complete

Epoch 0 |   Training | Elapsed Time: 0:38:34 | Steps: 18999 | Loss: 75.033748                                                                                                                                            
Epoch 0 | Validation | Elapsed Time: 0:04:26 | Steps: 4852 | Loss: 95.878292 | Dataset: dev.csv                                                                                                                          
I Saved new best validating model with loss 95.878292 to: /home/dimanshu/latestcheckpoiint/checkpoint/best_dev-404775
Epoch 1 |   Training | Elapsed Time: 0:38:28 | Steps: 18999 | Loss: 75.318631                                                                                                                                            
Epoch 1 | Validation | Elapsed Time: 0:04:26 | Steps: 4852 | Loss: 95.129001 | Dataset: dev.csv     

done with 75 epochs

WER: 1.000000, CER: 1.000000, loss: 194.206146
 - wav: file:///home/dimanshu/mydatadeepspeech/youtube-course-1/final_sound/d5b17000-a6d4-416d-8d97-5f4852536d8a.wav
 - src: "language C++ Java PHP Python JavaScript"
 - res: ""
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 183.460220
 - wav: file:///home/dimanshu/mydatadeepspeech/youtube-course-1/final_sound/78ec2f60-1272-40c2-8385-a14684178572.wav
 - src: "when I use Moe followed by an underscore"
 - res: ""

why result is still blank ?

What GPU are you using?

What language are you training?

What is your alphabet file like and is it the same for training/testing?

Use train and dev batch sizes.

Train from scratch not from checkpoint.

And 20k files is typically not enough for a full language model. 200k is more like it.

GPU = tesla t4 but now im using v100
language = english
alphabet file =capital and small letter , special character and numbers .
train and dev batch size =default
it will take too much time to train from scratch so that’s why i’m training over the existing latest checkpoint releases v0.6.1

after 75epochs
Loading model from file …/best_path/output_graph.pb
TensorFlow: v1.14.0-21-ge77504a
DeepSpeech: v0.6.1-0-g3df20fe
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2020-04-27 05:04:41.270571: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 1.61s.
Running inference.
n l sth js

expected output=am global to make sure that it just
output from the model =n l sth js

how much more data and training is required .

If you are running 20k files at 6 secs each, it should take 5 mins or so per epoch. So increase the batch size and that should work.

Judging from the output you are not using GPU? Either way, check that you are and that you get about 5 min per epoch with batch = 64 or so.

Typically the alphabet is just letters, no special chars and no number. Check num2words for that.

Maybe check what this repo does and you’ll get decent results:

hi @othiele i changed the batch size to 64 and now it is taking 1 min to every epoch
and i started training for 800 epochs and 560 epoch loss is constant
Epoch 561 | Training | Elapsed Time: 0:01:00 | Steps: 296 | Loss: 48.228566
Epoch 561 | Validation | Elapsed Time: 0:00:06 | Steps: 75 | Loss: 53.046018 | Dataset: dev.csv
Epoch 562 | Training | Elapsed Time: 0:01:01 | Steps: 296 | Loss: 48.011230
Epoch 562 | Validation | Elapsed Time: 0:00:06 | Steps: 75 | Loss: 53.025083 | Dataset: dev.csv
Epoch 563 | Training | Elapsed Time: 0:01:01 | Steps: 296 | Loss: 48.257540
Epoch 563 | Validation | Elapsed Time: 0:00:06 | Steps: 75 | Loss: 52.856465 | Dataset: dev.csv

i used this :-
./DeepSpeech.py --n_hidden 2048 --save_checkpoint_dir /home/dimanshu/latestcheckpoiint/checkpoint --load_checkpoint_dir /home/dimanshu/latestcheckpoiint/checkpoint --epochs 5 --train_files train.csv --test_files test.csv --dev_files --learning_rate 0.0001 --train_cudnn true --alphabet_config_path /home/dimanshu/alpha.txt --export_dir /home/dimanshu/best_path/ --train_batch_size 64 --test_batch_size 64 --dev_batch_size 64

  1. how to reduce the more loss
  2. what are the parameters that i have to change dropout layer or anything else ?

3)how to see WER after every epoch ?

You don’t want that.

Your data, your knowledge, we can’t teach you.

Depends on how your networks learns. Again, your data, your training, your knowledge.

Adding to @lissyx try a dropout of 0.4, but I as I said you may need about 200k to get a better WER. Your results are ok for that amount of data if the language is quite diverse.

@othiele i have some questions
im having dataset of = 80k

and when i started training it is showing this result

Epoch 0 | Training | Elapsed Time: 0:03:43 | Steps: 1049 | Loss: 30.588450
Epoch 0 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 28.513289 | Dataset: dev.csv
I Saved new best validating model with loss 28.513289 to: /home/dimanshu/latestcheckpoiint/new/best_dev-234833
Epoch 1 | Training | Elapsed Time: 0:03:34 | Steps: 1049 | Loss: 20.601734
Epoch 1 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 27.018155 | Dataset: dev.csv
I Saved new best validating model with loss 27.018155 to: /home/dimanshu/latestcheckpoiint/new/best_dev-235882
Epoch 2 | Training | Elapsed Time: 0:03:33 | Steps: 1049 | Loss: 17.995548
Epoch 2 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 26.553436 | Dataset: dev.csv
I Saved new best validating model with loss 26.553436 to: /home/dimanshu/latestcheckpoiint/new/best_dev-236931
Epoch 3 | Training | Elapsed Time: 0:03:33 | Steps: 1049 | Loss: 16.707786
Epoch 3 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 25.606835 | Dataset: dev.csv
I Saved new best validating model with loss 25.606835 to: /home/dimanshu/latestcheckpoiint/new/best_dev-237980
Epoch 4 | Training | Elapsed Time: 0:03:38 | Steps: 1049 | Loss: 15.702921
Epoch 4 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 24.864427 | Dataset: dev.csv
I Saved new best validating model with loss 24.864427 to: /home/dimanshu/latestcheckpoiint/new/best_dev-239029
Epoch 5 | Training | Elapsed Time: 0:03:38 | Steps: 1049 | Loss: 14.961168
Epoch 5 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 25.519202 | Dataset: dev.csv
Epoch 6 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 15.021698
Epoch 6 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 25.594796 | Dataset: dev.csv
Epoch 7 | Training | Elapsed Time: 0:03:38 | Steps: 1049 | Loss: 14.374933
Epoch 7 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 25.584334 | Dataset: dev.csv
Epoch 8 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 14.407759
Epoch 8 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 25.351115 | Dataset: dev.csv
Epoch 9 | Training | Elapsed Time: 0:03:38 | Steps: 1049 | Loss: 14.583124
Epoch 9 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 26.003742 | Dataset: dev.csv
Epoch 10 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 15.178494
Epoch 10 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 25.597253 | Dataset: dev.csv
Epoch 11 | Training | Elapsed Time: 0:03:38 | Steps: 1049 | Loss: 15.777427
Epoch 11 | Validation | Elapsed Time: 0:00:16 | Steps: 187 | Loss: 27.344462 | Dataset: dev.csv
Epoch 12 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 18.312971
Epoch 12 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 33.670939 | Dataset: dev.csv
Epoch 13 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 27.368952
Epoch 13 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 94.934918 | Dataset: dev.csv
Epoch 14 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 58.726450
Epoch 14 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 68.836654 | Dataset: dev.csv
Epoch 15 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 49.482339
Epoch 15 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 52.932693 | Dataset: dev.csv
Epoch 16 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 47.158406
Epoch 16 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 58.368865 | Dataset: dev.csv
Epoch 17 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 48.946441
Epoch 17 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 61.969229 | Dataset: dev.csv
Epoch 18 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 47.054758
Epoch 18 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 107.120515 | Dataset: dev.csv
Epoch 19 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 62.782858
Epoch 19 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 59.996260 | Dataset: dev.csv
Epoch 20 | Training | Elapsed Time: 0:03:36 | Steps: 1049 | Loss: 49.313696
Epoch 20 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 55.930255 | Dataset: dev.csv
Epoch 21 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 50.548152
Epoch 21 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 61.734633 | Dataset: dev.csv
Epoch 22 | Training | Elapsed Time: 0:03:36 | Steps: 1049 | Loss: 49.997154
Epoch 22 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 83.610665 | Dataset: dev.csv
Epoch 23 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 70.947716
Epoch 23 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 65.014022 | Dataset: dev.csv
Epoch 24 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 53.777456
Epoch 24 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 53.490376 | Dataset: dev.csv
Epoch 25 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 46.280939
Epoch 25 | Validation | Elapsed Time: 0:00:15 | Steps: 187 | Loss: 54.534843 | Dataset: dev.csv
Epoch 26 | Training | Elapsed Time: 0:01:00 | Steps: 402 | Loss: 46.347052 Epoch 26 | Training | Elapsed Time: 0:01:00 | Steps: 403 | Loss: 46.328556 Epoch 26 | Training | Elapsed Time: 0:03:37 | Steps: 1049 | Loss: 47.923410

  1. why training and validation loss is increasing ?

  2. after completing this if i start the training again then it will resume from best check path which is on 4th epoch so there is no point after 4th epoch.?

  3. my dataset consist of only alphabet and number
    = http://34.83.214.234/show/meta-train.csv

  4. how to train this so that validation loss will increase. m i doing something wrong ?
    and i can’t see WER

./DeepSpeech.py --n_hidden 2048 --save_checkpoint_dir /home/dimanshu/latestcheckpoiint/load --load_checkpoint_dir /home/dimanshu/latestcheckpoiint/load --epochs 100 --train_files train.csv --test_files test.csv --dev_files dev.csv–learning_rate 0.0001 --train_cudnn true --alphabet_config_path /home/dimanshu/alpha.txt --export_dir /home/dimanshu/best_path --train_batch_size 64 --dev_batch_size 64 --test_batch_size 64

after completing the training the result of test.csv will be null
** wer:1 **
res:""

  1. Because validation changes the hyperparameters.

  2. Training didn’t get any better, therefore 4th.

  3. Link is dead …

  4. Please read my comment above, I can’t help you if you don’t.

http://34.83.214.234/show/meta_train.csv

yes i will add more data to make it 200k .
1)so i have to fine tune my model first then start the training ?and i will use dropout = 0.4 and learning rate= 0.0001

Ah, you can read :slight_smile: The dropout should get you a lot further.

1 Like