Noise injection training experiment

Hey @lissyx @othiele I have decided to use voxforge dataset through the importer in the bin folder.

I have 12gb of ram and a Tesla 80k 12Gb GPU , would the process be more efficient in terms of time with the same gpu and 24 gb of ram?

As I have limited resources but still would like to get some meaningful data without training for days I was wondering if you could give me some advice on what parameters i could run my training on?

My main aim is not to have an extremely low WER it’s only that there is some correlation shown between the different models Testing data. I am looking to train them for around 6 hours max as I have limited time.

What value should I use with

–epochs
–train_batch_size
–dev_batch_size
–test_batch_size
–n_hidden
–learning_rate

Im not sure how dropout rate works, should I change those also?

Also about the Add Augmentation from the documentation - https://deepspeech.readthedocs.io/en/v0.8.2/TRAINING.html#augmentation
–augment add[p=,stddev=,domain=]

If I specify domain=‘spectogram’ then If I understand correctly, random number values will be added to the number representation of the audio?

Do you by chance have detailed documentation on the way it works? I will try looking in the code if not. Do I find details on this in the DeepSpeech.py script?

Thanks in advance!

Probably not, 12 GB should be fine for K80, should have >5 CPUs if possible

6 hours is not much, you should get 10-15 epochs per model for somewhat ok results.

The higher the batch size the better, e.g. 8, 16, …

Default n_hidden

Same dropout (e.g. 0.3 or 0.4) and learning rate (default e-03 ) for all models as this changes results dramatically.

Don’t know about augmentation, I wouldn’t count on the documentation, read the code.

@othiele Thank you the training worked I ended up with a 94% WER thought but it finished optimising in only 3 hours so I can still increase the epochs from 10 to 20 in the next run. I will also try to increase batch size from 64 to 76 or 88, it was using only 4 gb of memory . This time I didnt do augmentation yet just wanted to see if it finished properly. Will report back once I am done with the models or if I run into some problems.

Also could i train on another language or would that introduce some intereference into the data i would be getting by the 3 models because of the language complexity?

Would the smaller dataset of another language help get better results in training in less amount of time?

Also is there a chance that my model is underfitting or the validation loss and training loss are good like this? These are figure after 5 hours of training.

Thanks again .

I0903 17:28:48.316457 140227250132864 utils.py:141] NumExpr defaulting to 2 threads.
I Loading best validating checkpoint from /content/checks/best_dev-13240
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Loading variable from checkpoint: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:18:26 | Steps: 1324 | Loss: 81.477223
Epoch 0 | Validation | Elapsed Time: 0:00:05 | Steps: 10 | Loss: 81.157199 | Dataset: /content/voxforge/voxforge-dev.csv
I Saved new best validating model with loss 81.157199 to: /content/checks/best_dev-14564

Epoch 1 | Training | Elapsed Time: 0:17:59 | Steps: 1324 | Loss: 80.177082
Epoch 1 | Validation | Elapsed Time: 0:00:04 | Steps: 10 | Loss: 80.640202 | Dataset: /content/voxforge/voxforge-dev.csv
I Saved new best validating model with loss 80.640202 to: /content/checks/best_dev-15888

Epoch 2 | Training | Elapsed Time: 0:17:58 | Steps: 1324 | Loss: 79.045931
Epoch 2 | Validation | Elapsed Time: 0:00:04 | Steps: 10 | Loss: 79.040100 | Dataset: /content/voxforge/voxforge-dev.csv
I Saved new best validating model with loss 79.040100 to: /content/checks/best_dev-17212

Epoch 3 | Training | Elapsed Time: 0:17:53 | Steps: 1324 | Loss: 77.994055
Epoch 3 | Validation | Elapsed Time: 0:00:04 | Steps: 10 | Loss: 79.000660 | Dataset: /content/voxforge/voxforge-dev.csv
I Saved new best validating model with loss 79.000660 to: /content/checks/best_dev-18536

Epoch 4 | Training | Elapsed Time: 0:17:44 | Steps: 1324 | Loss: 77.084805
Epoch 4 | Validation | Elapsed Time: 0:00:04 | Steps: 10 | Loss: 78.440584 | Dataset: /content/voxforge/voxforge-dev.csv
I Saved new best validating model with loss 78.440584 to: /content/checks/best_dev-19860

Epoch 5 | Training | Elapsed Time: 0:17:45 | Steps: 1324 | Loss: 76.281919
Epoch 5 | Validation | Elapsed Time: 0:00:04 | Steps: 10 | Loss: 78.114109 | Dataset: /content/voxforge/voxforge-dev.csv
I Saved new best validating model with loss 78.114109 to: /content/checks/best_dev-21184

Epoch 6 | Training | Elapsed Time: 0:17:39 | Steps: 1324 | Loss: 75.526123
Epoch 6 | Validation | Elapsed Time: 0:00:04 | Steps: 10 | Loss: 77.518653 | Dataset: /content/voxforge/voxforge-dev.csv
I Saved new best validating model with loss 77.518653 to: /content/checks/best_dev-22508

Epoch 7 | Training | Elapsed Time: 0:17:49 | Steps: 1324 | Loss: 74.905342
Epoch 7 | Validation | Elapsed Time: 0:00:04 | Steps: 10 | Loss: 77.188713 | Dataset: /content/voxforge/voxforge-dev.csv
I Saved new best validating model with loss 77.188713 to: /content/checks/best_dev-23832

Epoch 8 | Training | Elapsed Time: 0:17:54 | Steps: 1324 | Loss: 74.230232
Epoch 8 | Validation | Elapsed Time: 0:00:04 | Steps: 10 | Loss: 76.931261 | Dataset: /content/voxforge/voxforge-dev.csv
I Saved new best validating model with loss 76.931261 to: /content/checks/best_dev-25156

Hey @lissyx @othiele I tried training with the same training parameters but with augmentation and i got the following error

!python3 DeepSpeech.py --train_files /content/voxforge/voxforge-train.csv --test_files /content/voxforge/voxforge-test.csv --dev_files /content/voxforge/voxforge-dev.csv --epochs 15 --dev_batch_size 64 --train_batch_size 64 --test_batch_size 64 --log_dir /content/loggs --export_dir /content/models/ --train_cudnn True --checkpoint_dir /content/checks/ --alphabet_config_path /content/voxforge/alphabet.txt --export_model_name 'sept4,0.5,1,spec,noise1' --summary_dir /content/tensorsumm/ --augment add[p=0.5,stddev=0.5,domain='spectrogram']

 I0904 09:25:33.918955 139703338915712 utils.py:141] NumExpr defaulting to 2 
 threads.
 I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 |   Training | Elapsed Time: 0:33:28 | Steps: 1313 | Loss: 148.881187   Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 518, 64, 2048] 
	 [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
	 [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_69]]
  (1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 518, 64, 2048] 
	 [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 961, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 933, in main
    train()
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 601, in train
    train_loss, _ = run_set('train', epoch, train_init_op)
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 566, in run_set
    feed_dict=feed_dict)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 518, 64, 2048] 
	 [[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_69]]
  (1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 518, 64, 2048] 
	 [[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3':
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 961, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 933, in main
    train()
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 479, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 321, in get_tower_results
    gradients = optimizer.compute_gradients(avg_loss)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/optimizer.py", line 512, in compute_gradients
    colocate_gradients_with_ops=colocate_gradients_with_ops)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in _GradientsHelper
    lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 350, in _MaybeCompile
    return grad_fn()  # Exit early
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in <lambda>
    lambda: grad_fn(op, *out_grads))
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/cudnn_rnn_grad.py", line 104, in _cudnn_rnn_backwardv3
    direction=op.get_attr("direction")) + (None,)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 749, in cudnn_rnn_backprop_v3
    name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

...which was originally created as op 'tower_0/cudnn_lstm/CudnnRNNV3', defined at:
  File "DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
[elided 4 identical lines from previous traceback]
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 479, in train
    gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 312, in get_tower_results
    avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 239, in calculate_mean_edit_distance_and_loss
    logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 190, in create_model
    output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
  File "/usr/local/lib/python3.6/dist-packages/deepspeech_training/train.py", line 128, in rnn_impl_cudnn_rnn
    sequence_lengths=seq_length)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
    outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
    return converted_call(f, options, args, kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
    return _call_unconverted(f, args, kwargs, options)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call
    training)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward
    seed=self._seed)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn
    outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3
    time_major=time_major, name=name)

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/util.py", line 262, in _run_finalizers
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
    with self._rlock:
  File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
    return self._semlock.__enter__()
KeyboardInterrupt
  File "/usr/lib/python3.6/multiprocessing/util.py", line 186, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 571, in _terminate_pool
    cls._help_stuff_finish(inqueue, task_handler, len(pool))
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 556, in _help_stuff_finish
    inqueue._rlock.acquire()
KeyboardInterrupt
Process ForkPoolWorker-2:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/usr/lib/python3.6/multiprocessing/queues.py", line 335, in get
    res = self._reader.recv_bytes()
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
    buf = self._recv(4)
  File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
    chunk = read(handle, remaining)
KeyboardInterrupt

From the response above you can see I ran a training sucessfuly with no --augment and --epochs 10 , I don’t know why i got this error, what can I change?

Thanks in advance!

Please search on github issues this is upstream tensorflow issue.

Hey @othiele @lissyx or anyone reading. It turned out to be a gpu ram issue as i decreased batch size the training went through, I also need to be careful with the probability of the augmentation as considering the voxforge dataset I cannot handle half of the audio being augmented, I decided to decrease probability to about 0,03; 0,05 ; 0,1.

My two trainings didn’t work at all as for testing one of them only returned the two
letters ‘e a’ and the other only spaces. I assume it’s because I have made the ‘stddev’ that is the added noise too big for it to be able to learn anything.

On this note I am having a hard time understanding a decent value for ‘stddev’ in the --augment add[] argument, Maybe you could help me how the standard deviation of the normal distribution works, in terms of what marginal value could I be giving it for little and for lot of noise?

Furthermore as training from scratch as I have now realised takes a lot of resources and is quite painful to experiment with in terms of time which I dont have given I have a deadline for the experiment. Would it be a wise choice to use the pre trained model for further training? Would that give me visible results faster , hypothetically
speaking?

Thanks in advance, you are helping me immensely!

  1. You should get a regular training without augmentation going so you have a baseline.

  2. If you just get one letter output this could mean too few data or epochs.

  3. Transfer might be a good idea, as we said before, you’ll need to experiment a bit and this might take some time.

@othiele Hey I trained on the release v.0.8.2 checkpoints for 10 epochs with reduced learning rate on plateaue, and got my loss to around 7.345 , and it has a WER of 26% . Is it enough to start my new training from the checkpoints or is there something else I need to import so the training starts from the previous Training and validation loss of my model?

Thanks in advance.

I am not sure I understand what your are doing.

What material do you use for finetuning, what for testing, what is this step for? WER of 0.26 sounds high for the release.

Sorry i will try to provide more info.

So i take the 0.8.2 checkpoints and start my training from there using the Voxforge dataset , with the !bin/import_voxforge.py util . I initiate the training with the argument which reduces the learning rate if the loss plateaus. I do this cause I have found that the loss increases if I keep the initial learning rate. With this I ran the training for 10 epochs using the voxforge dataset and 64 batch sizes. After the 10 epochs I end with a loss of around 7.45 . With this model the test epoch yields a 0.26 WER .

When i try to continue from the checkpoints saved from this model which I have fine tuned for 10 epochs. I find my training loss to start from 156.00 . Is this normal? Shouldn’t it start from the loss where it left off after fine tuning?

If you need any other information I will try my best to provide it.

Learning rate, dropout?

Search fine tuning, transfer learning in this forum for numbers, which you obviously didn’t …

High loss is fine, but 10 epochs might not be enough

Thanks will do more! I will get back if I have other problems, thank you!

Hey @othiele , @lissyx I have a question about the test epoch. So Rather than doing inference testing with the exported model would Running training from the same checkpoints for 0 epochs be testing? I tried it and it immediately gets to the testing phase. Is it a valid way to test models?

Thanks in advance

Please search before you post. Let us know what you found out while searching and we’ll happily add to that.