How to monitor training/validation loss during training

zara · December 24, 2020, 1:03pm

I am trying to train my own model, I read the documentation but I could not find a way to monitor training and validation loss per epoch (or iterations) during or even after training. I know I can view some of these logs in terminal output like:

I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:53:50 | Steps: 4808 | Loss: 9.299799                                                                                                         
Epoch 0 | Validation | Elapsed Time: 0:00:33 | Steps: 83 | Loss: 23.219886

but I connect to a remote server and sometime lose the outputs.

reuben · December 24, 2020, 2:17pm

Use a terminal multiplexer such as tmux, it’ll save your logs and you’ll be able to re-attach to the terminal after disconnecting. Make sure you set it up for unlimited scrollback log if you’re going to rely on it. You can also use something like tee to save to a log file at the same time as it’s logging on the terminal. Finally, you can use the --summary_dir flag to save TensorBoard summaries of the training and validation loss graphs.

zara · December 26, 2020, 4:20am

@reuben Great suggestions. I used tmux and tee, but I was looking for something like --summary_dir flag, thanks!