Getting wrong Tensorboard Graphs respective to Training Output

Bala_Murugan · January 12, 2021, 7:36am

Hi,

 Usually I used to store the Training Outputs in Text file. When I compare those Training outputs with the Tensorboard Graphs. Its Mismatching. . .

Tensorboard is showing graphs which is irrelavant to the Training Output.

See in my Training Output. . .

Loading best validating checkpoint from /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/deepspeech-0.9.2-checkpoint/best_dev-1564701
learning_rate : 0.000005

See My Training and Validation loss is Keep on decreasing, but the tensorgraph showing irrelevant outputs.

Bala_Murugan · January 12, 2021, 5:20pm

Could you please respond to solve this issue . . . @lissyx @reuben

lissyx · January 12, 2021, 5:24pm

Yeah, no reply within 10h, it’s really welcome to ping people, not rude at all.

You’re looking at step loss, I’m not sure what is wrong here.

Bala_Murugan · January 14, 2021, 4:58pm

Sorry, no response for the post so I tagged you.

Can I consider the Step Loss to identify, the data is overfit or not. Because some of the deeplearning models get overfitted with small amount of data.

Why deepspeech training takes upto - 200 epochs after that export the model. If we identify the data is overfit/no improvement in the validation loss, then we can close the training like early_stop mechanism in deepspeech.

lissyx · January 14, 2021, 4:54pm

10h, seriously. Are we on call for you? I don’t think so.

This is not the question you asked at first, you mentionned Tensorboard is showing irrelevant output. Which is not true, looking at your text output and the graph …

Do you understand what the loss is? It’s its evolution that matters. Yours does not show anything interesting, but since you did not care to share your training setup, there’s nothing we can do.

Bala_Murugan · January 14, 2021, 5:48pm

I think its 9 hrs . . Okay leave it.

“My Summary Report exported from the Deepspeech Training and the Training Output” is attached as Zip.

My Training Parameters are

python DeepSpeech.py --n_hidden 2048
–checkpoint_dir /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/deepspeech-0.9.2-checkpoint
–epochs 150
–train_files /home/bala/Speech_Recognition/Training_Data_AWS/Dec-30_v1_Files/awsgrouptrain.csv
–dev_files /home/bala/Speech_Recognition/Training_Data_AWS/Dec-30_v1_Files/awsgroupdev.csv
–test_files /home/bala/Speech_Recognition/Training_Data_AWS/Dec-30_v1_Files/awsgrouptest.csv
–export_dir /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/learningrate_000005/export_directory
–learning_rate 0.000005
–train_batch_size 16
–dev_batch_size 8
–test_batch_size 8
–dropout_rate 0.40
–export_file_name no_aug-1.0.0
–export_model_name no_aug
–export_model_version 1.0.0
–export_author_id no_augmentation
–scorer /home/bala/Speech_Recognition/External_Scorer/AWS_v1_Jan4/stt-1.0.0.scorer
–scorer_path /home/bala/Speech_Recognition/External_Scorer/AWS_v1_Jan4/stt-1.0.0.scorer
–summary_dir /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/learningrate_000005/summary
–train_cudnn > /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/learningrate_000005/Trainingoutput.txt

Tensorboard Command

tensorboard
–logdir /home/bala/Speech_Recognition/Training_Pappa/v1-Jan5_No_Augment/learningrate_000005/summary
–port 9000
–host 0.0.0.0

I hope, the given info is sufficient. . .
Training_Output.zip (1.5 MB)