Getting wrong Tensorboard Graphs respective to Training Output

Hi,

 Usually I used to store the Training Outputs in Text file. When I compare those Training outputs with the Tensorboard Graphs. Its Mismatching. . . 

Tensorboard is showing graphs which is irrelavant to the Training Output.

See in my Training Output. . .

Loading best validating checkpoint from /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/deepspeech-0.9.2-checkpoint/best_dev-1564701
learning_rate : 0.000005

Epoch 0 | Training | Elapsed Time: 0:07:48 | Steps: 963 | Loss: 33.665629
Epoch 0 | Validation | Elapsed Time: 0:01:14 | Steps: 482 | Loss: 32.672577 | Dataset: /home/bala/Speech_Recognition/Training_Data_AWS/Dec-30_v1_Files/awsgroupdev.csv
I Saved new best validating model with loss 32.672577 to: /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/deepspeech-0.9.2-checkpoint/best_dev-1565664

Epoch 1 | Training | Elapsed Time: 0:07:59 | Steps: 963 | Loss: 33.600953
Epoch 1 | Validation | Elapsed Time: 0:01:13 | Steps: 482 | Loss: 32.527303 |

Epoch 2 | Training | Elapsed Time: 0:07:58 | Steps: 963 | Loss: 33.350906
Epoch 2 | Validation | Elapsed Time: 0:01:12 | Steps: 482 | Loss: 32.453275 |

Epoch 3 | Training | Elapsed Time: 0:07:58 | Steps: 963 | Loss: 33.195642
Epoch 3 | Validation | Elapsed Time: 0:01:13 | Steps: 482 | Loss: 32.487144 |

Epoch 4 | Training | Elapsed Time: 0:07:58 | Steps: 963 | Loss: 33.144140
Epoch 4 | Validation | Elapsed Time: 0:01:12 | Steps: 482 | Loss: 32.416610

Epoch 5 | Training | Elapsed Time: 0:07:59 | Steps: 963 | Loss: 32.796107
Epoch 5 | Validation | Elapsed Time: 0:01:13 | Steps: 482| Loss: 32.416092

Epoch 6 | Training | Elapsed Time: 0:08:00 | Steps: 963 | Loss: 32.730357
Epoch 6 | Validation | Elapsed Time: 0:01:12 | Steps: 482 | Loss: 32.320890

Epoch 7 | Training | Elapsed Time: 0:08:00 | Steps: 963 | Loss: 32.626243
Epoch 7 | Validation | Elapsed Time: 0:01:13 | Steps: 482 | Loss: 32.322866

Epoch 8 | Training | Elapsed Time: 0:08:00 | Steps: 963 | Loss: 32.543887
Epoch 8 | Validation | Elapsed Time: 0:01:12 | Steps: 482 | Loss: 32.240156 |

Epoch 9 | Training | Elapsed Time: 0:07:59 | Steps: 963 | Loss: 32.429364
Epoch 9 | Validation | Elapsed Time: 0:01:13 | Steps: 482 | Loss: 32.168196

Epoch 10 | Training | Elapsed Time: 0:07:58 | Steps: 963 | Loss: 32.202116
Epoch 10 | Validation | Elapsed Time: 0:01:12 | Steps: 482 | Loss: 32.198536

.
.
.
Epoch 44 | Training | Elapsed Time: 0:08:06 | Steps: 963 | Loss: 28.093275
Epoch 44 | Validation | Elapsed Time: 0:01:13 | Steps: 482 | Loss: 31.266112

See My Training and Validation loss is Keep on decreasing, but the tensorgraph showing irrelevant outputs.

Could you please respond to solve this issue . . . @lissyx @reuben

Yeah, no reply within 10h, it’s really welcome to ping people, not rude at all.

You’re looking at step loss, I’m not sure what is wrong here.

Sorry, no response for the post so I tagged you.

Can I consider the Step Loss to identify, the data is overfit or not. Because some of the deeplearning models get overfitted with small amount of data.

Why deepspeech training takes upto - 200 epochs after that export the model. If we identify the data is overfit/no improvement in the validation loss, then we can close the training like early_stop mechanism in deepspeech.

10h, seriously. Are we on call for you? I don’t think so.

This is not the question you asked at first, you mentionned Tensorboard is showing irrelevant output. Which is not true, looking at your text output and the graph …

Do you understand what the loss is? It’s its evolution that matters. Yours does not show anything interesting, but since you did not care to share your training setup, there’s nothing we can do.

I think its 9 hrs . . Okay leave it.

“My Summary Report exported from the Deepspeech Training and the Training Output” is attached as Zip.

My Training Parameters are

python DeepSpeech.py --n_hidden 2048
–checkpoint_dir /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/deepspeech-0.9.2-checkpoint
–epochs 150
–train_files /home/bala/Speech_Recognition/Training_Data_AWS/Dec-30_v1_Files/awsgrouptrain.csv
–dev_files /home/bala/Speech_Recognition/Training_Data_AWS/Dec-30_v1_Files/awsgroupdev.csv
–test_files /home/bala/Speech_Recognition/Training_Data_AWS/Dec-30_v1_Files/awsgrouptest.csv
–export_dir /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/learningrate_000005/export_directory
–learning_rate 0.000005
–train_batch_size 16
–dev_batch_size 8
–test_batch_size 8
–dropout_rate 0.40
–export_file_name no_aug-1.0.0
–export_model_name no_aug
–export_model_version 1.0.0
–export_author_id no_augmentation
–scorer /home/bala/Speech_Recognition/External_Scorer/AWS_v1_Jan4/stt-1.0.0.scorer
–scorer_path /home/bala/Speech_Recognition/External_Scorer/AWS_v1_Jan4/stt-1.0.0.scorer
–summary_dir /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/learningrate_000005/summary
–train_cudnn > /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/learningrate_000005/Trainingoutput.txt

Tensorboard Command

tensorboard
–logdir /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/learningrate_000005/summary
–port 9000
–host 0.0.0.0

I hope, the given info is sufficient. . .
Training_Output.zip (1.5 MB)