Job assignment. In my free time I generally have worked with CNNs and image data so RNNs are a new venture for me.
How much?
I’m curious of your input data, is it 8kHz or 16kHz? Mono or Stereo? Those would be very important if you look into fine-tuning or transfer-learning.
I’m wondering how much you can just adapt existing framework to your problem. Given the quality, I really wonder if you had a look at:
- some low/high pass filtering,
- some de-noising (rnnnoise was told to be quite good)
I suspect you’ve run experiments already, so if you could elaborate more on those, that might help us understand how far you are from your goal and see if there’s something to help about in the meantime
As for inference results, I can’t really provide a large amount since the amount of data being collected is still undergoing collection. In ad-hoc cases though the results definitely exceed 90% WER at inference with the Deepspeech0.7.x model and scorer.
I have scripting that converts my files to 16k khz before I ever use them for deepspeech inference or training. As for stereo vs mono, I have to assess. DeepSpeech uses mono, correct? I can convert them to mono if necessary.
I also have scripting that can utilize low/high and band pass filters on my data. I have little signal processing background but I have found some success in various parameters here.
You can get accurate picture with a few hours already
It’d be interesting to check the CER as well.
You convert to 16kHz from what?
Our model is trained on 16kHz mono, if you feed anything else it will produce erratic results. Our example binaries might do automatic down/up-sampling and stereo to mono, but it might introduce glitches, so in specific cases like yours it’s always beneficial to completely control this.
Some of the air traffic audio I have obtained comes from Youtube, in which I pull the audio in .mp4, convert to .mp3 at 44100 khz and then I downsample to 16k, export as wav. I’ll need to convert to mono in this step as well. As for the optimal order of processing, does it matter?
Hard to tell, I don’t think it should impact, but we’d be curious to know your feedback on this processing.
Got it. I’ll go through my files and convert and see. Will probably be able to provide some WER and CER benchmarks pre and post conversion in the next few days.
This may be a bit of an aside but in reference to the title of my topic. I benchmarked the 0.7.4 release model on the Google Commands test set I am using which is about 6400 one second audio one word clips.
With Scorer: WER 48%, CER 32%
Without Scorer (just model use at inference time): WER 42%, CER 23%
Thought I would share. I know that making my own scorer for this data is challenging since my corpus is strictly unigrams and Kenlm doesn’t support unigram order language models. I am wondering if it is due to the Mozilla Scorer being built primarily on sentences that causes the .scorer to hurt me out of the box.
Regardless, thought it was important to share. I am going to attempt fine-tuning the 0.7.4 release on Google Commands with a smaller learning rate. Not sure how many epochs I will need to be effective but I’ll iterate through a few different combinations, although will probably be slow while my hardware is currently lacking.
There’s a trick you can use as workaround, just add a single sentence with more words, maybe some not used in your benchmark.
Interesting. So add a single sentence with random words and use arpa order = # of words in that sentence? How does this get around it?
Interesting. So add a single sentence with random words and use arpa order = # of words in that sentence? How does this get around it?
Yes. I think because kenlm now can build a #-gram, it doesn’t raise the error. And the rest of the #-grams are 1-grams. I did use this to run a benchmark and the 1-gram results were worse than the 3-gram results but better than a not specialized language model, so I’m assuming this approach is not bad^^
Thanks for the hack. I will try it myself.
@lissyx Working on fine-tuning. Running into an error. My thoughts suspect OOM but not sure.
Cuda and cuDNN info:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
root@d7284da3dc5c:/DeepSpeech# cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2
#define CUDNN_MAJOR 7
#define CUDNN_MINOR 6
#define CUDNN_PATCHLEVEL 5
fine-tuning bash script:
#!/bin/sh
set -xe
if [ ! -f DeepSpeech.py ]; then
echo "Please make sure you run this from DeepSpeech's top level directory."
exit 1
fi;
export NVIDIA_VISIBLE_DEVICES=0
export CUDA_VISIBLE_DEVICES=0
#export TF_FORCE_GPU_ALLOW_GROWTH=true
python3 -u DeepSpeech.py \
--train_files //tmp/external/google_cmds_csvs/train.csv \
--test_files //tmp/external/google_cmds_csvs/test.csv \
--dev_files //tmp/external/google_cmds_csvs/dev.csv \
--epochs 5 \
--train_batch_size 1 \
--dev_batch_size 1 \
--test_batch_size 1 \
--export_dir //tmp/external/deepspeech_models/deepspeech_fine_tuned_models/googlecommands/ \
--use_allow_growth \
--n_hidden 2048 \
--train_cudnn \
--checkpoint_dir //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/ \
"$@"
I imagine this error means my GPU is running out of memory? Probably can’t handle the n_hidden 2048? Even with batch sizes equal to 1…
I Loading best validating checkpoint from //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/best_dev-732522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Initializing variable: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 1 | Loss: 17.931751
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 2 | Loss: 25.073557
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 3 | Loss: 22.607900
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 4 | Loss: 19.332927
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 5 | Loss: 18.299621
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 6 | Loss: 22.737631
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 7 | Loss: 21.567622
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 8 | Loss: 23.387867
2020-07-30 15:45:48.370974: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:273] Unexpected Event status: 1
2020-07-30 15:45:48.371022: F ./tensorflow/core/kernels/reduction_gpu_kernels.cu.h:830] Non-OK-status: GpuLaunchKernel(ColumnReduceKernel<IN_T, OUT_T, Op>, grid_dim, block_dim, 0, cu_stream, in, out, extent_x, extent_y, op, init) status: Internal: unknown error
Fatal Python error: Aborted
Thread 0x00007f5199679700 (most recent call first):
File "/usr/lib/python3.6/multiprocessing/connection.py", line 379 in _recv
File "/usr/lib/python3.6/multiprocessing/connection.py", line 407 in _recv_bytes
File "/usr/lib/python3.6/multiprocessing/connection.py", line 250 in recv
File "/usr/lib/pyAborted
I imagine this error means my GPU is running out of memory? Probably can’t handle the n_hidden 2048? Even with batch sizes equal to 1…
Sorry, first time I see that error …
I tried adding in my own --learning_rate == 0.0001
I think I have seen this error before in other discourse topics. Not sure what it means.
Error:
+ [ ! -f DeepSpeech.py ]
+ export NVIDIA_VISIBLE_DEVICES=0
+ export CUDA_VISIBLE_DEVICES=0
+ python3 -u DeepSpeech.py --train_files //tmp/external/google_cmds_csvs/train.csv --test_files //tmp/external/google_cmds_csvs/test.csv --dev_files //tmp/external/google_cmds_csvs/dev.csv --epochs 5 --train_batch_size 1 --dev_batch_size 1 --test_batch_size 1 --export_dir //tmp/external/deepspeech_models/deepspeech_fine_tuned_models/googlecommands/ --use_allow_growth --n_hidden 2048 --train_cudnn --learning_rate 0.0001 --checkpoint_dir //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/
I Loading best validating checkpoint from //tmp/external/mozilla_release_chkpts/deepspeech-0.7.4-checkpoint/best_dev-732522
I Loading variable from checkpoint: beta1_power
I Loading variable from checkpoint: beta2_power
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam
I Loading variable from checkpoint: cudnn_lstm/opaque_kernel/Adam_1
I Loading variable from checkpoint: global_step
I Loading variable from checkpoint: layer_1/bias
I Loading variable from checkpoint: layer_1/bias/Adam
I Loading variable from checkpoint: layer_1/bias/Adam_1
I Loading variable from checkpoint: layer_1/weights
I Loading variable from checkpoint: layer_1/weights/Adam
I Loading variable from checkpoint: layer_1/weights/Adam_1
I Loading variable from checkpoint: layer_2/bias
I Loading variable from checkpoint: layer_2/bias/Adam
I Loading variable from checkpoint: layer_2/bias/Adam_1
I Loading variable from checkpoint: layer_2/weights
I Loading variable from checkpoint: layer_2/weights/Adam
I Loading variable from checkpoint: layer_2/weights/Adam_1
I Loading variable from checkpoint: layer_3/bias
I Loading variable from checkpoint: layer_3/bias/Adam
I Loading variable from checkpoint: layer_3/bias/Adam_1
I Loading variable from checkpoint: layer_3/weights
I Loading variable from checkpoint: layer_3/weights/Adam
I Loading variable from checkpoint: layer_3/weights/Adam_1
I Loading variable from checkpoint: layer_5/bias
I Loading variable from checkpoint: layer_5/bias/Adam
I Loading variable from checkpoint: layer_5/bias/Adam_1
I Loading variable from checkpoint: layer_5/weights
I Loading variable from checkpoint: layer_5/weights/Adam
I Loading variable from checkpoint: layer_5/weights/Adam_1
I Loading variable from checkpoint: layer_6/bias
I Loading variable from checkpoint: layer_6/bias/Adam
I Loading variable from checkpoint: layer_6/bias/Adam_1
I Loading variable from checkpoint: layer_6/weights
I Loading variable from checkpoint: layer_6/weights/Adam
I Loading variable from checkpoint: layer_6/weights/Adam_1
I Initializing variable: learning_rate
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:00:00 | Steps: 0 | Loss: 0.000000
Epoch 0 | Training | Elapsed Time: 0:00:02 | Steps: 1 | Loss: 17.931751
Epoch 0 | Training | Elapsed Time: 0:00:03 | Steps: 2 | Loss: 23.942448
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
[[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
(1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
[[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]]
[[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "DeepSpeech.py", line 12, in <module>
ds_train.run_script()
File "/DeepSpeech/training/deepspeech_training/train.py", line 955, in run_script
absl.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/DeepSpeech/training/deepspeech_training/train.py", line 927, in main
train()
File "/DeepSpeech/training/deepspeech_training/train.py", line 595, in train
train_loss, _ = run_set('train', epoch, train_init_op)
File "/DeepSpeech/training/deepspeech_training/train.py", line 560, in run_set
feed_dict=feed_dict)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
[[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048]
[[node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3 (defined at /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3':
File "DeepSpeech.py", line 12, in <module>
ds_train.run_script()
File "/DeepSpeech/training/deepspeech_training/train.py", line 955, in run_script
absl.app.run(main)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/DeepSpeech/training/deepspeech_training/train.py", line 927, in main
train()
File "/DeepSpeech/training/deepspeech_training/train.py", line 473, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File "/DeepSpeech/training/deepspeech_training/train.py", line 321, in get_tower_results
gradients = optimizer.compute_gradients(avg_loss)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/training/optimizer.py", line 512, in compute_gradients
colocate_gradients_with_ops=colocate_gradients_with_ops)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
unconnected_gradients)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in _GradientsHelper
lambda: grad_fn(op, *out_grads))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 350, in _MaybeCompile
return grad_fn() # Exit early
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gradients_util.py", line 679, in <lambda>
lambda: grad_fn(op, *out_grads))
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/cudnn_rnn_grad.py", line 104, in _cudnn_rnn_backwardv3
direction=op.get_attr("direction")) + (None,)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 749, in cudnn_rnn_backprop_v3
name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
...which was originally created as op 'tower_0/cudnn_lstm/CudnnRNNV3', defined at:
File "DeepSpeech.py", line 12, in <module>
ds_train.run_script()
[elided 4 identical lines from previous traceback]
File "/DeepSpeech/training/deepspeech_training/train.py", line 473, in train
gradients, loss, non_finite_files = get_tower_results(iterator, optimizer, dropout_rates)
File "/DeepSpeech/training/deepspeech_training/train.py", line 312, in get_tower_results
avg_loss, non_finite_files = calculate_mean_edit_distance_and_loss(iterator, dropout_rates, reuse=i > 0)
File "/DeepSpeech/training/deepspeech_training/train.py", line 239, in calculate_mean_edit_distance_and_loss
logits, _ = create_model(batch_x, batch_seq_len, dropout, reuse=reuse, rnn_impl=rnn_impl)
File "/DeepSpeech/training/deepspeech_training/train.py", line 190, in create_model
output, output_state = rnn_impl(layer_3, seq_length, previous_state, reuse)
File "/DeepSpeech/training/deepspeech_training/train.py", line 128, in rnn_impl_cudnn_rnn
sequence_lengths=seq_length)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/layers/base.py", line 548, in __call__
outputs = super(Layer, self).__call__(inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/keras/engine/base_layer.py", line 854, in __call__
outputs = call_fn(cast_inputs, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 234, in wrapper
return converted_call(f, options, args, kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 439, in converted_call
return _call_unconverted(f, args, kwargs, options)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/impl/api.py", line 330, in _call_unconverted
return f(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 440, in call
training)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/layers/cudnn_rnn.py", line 518, in _forward
seed=self._seed)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/contrib/cudnn_rnn/python/ops/cudnn_rnn_ops.py", line 1132, in _cudnn_rnn
outputs, output_h, output_c, _, _ = gen_cudnn_rnn_ops.cudnn_rnnv3(**args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_cudnn_rnn_ops.py", line 2051, in cudnn_rnnv3
time_major=time_major, name=name)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found. (0) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048] [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]] (1) Internal: Failed to call ThenRnnBackward with model config: [rnn_mode, rnn_input_mode, rnn_direction_mode]: 2, 0, 0 , [num_layers, input_size, num_units, dir_count, max_seq_length, batch_size, cell_num_units]: [1, 2048, 2048, 1, 20, 1, 2048] [[{{node tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3}}]] [[tower_0/gradients/tower_0/cudnn_lstm/CudnnRNNV3_grad/CudnnRNNBackpropV3/_81]] 0 successful operations. 0 derived errors ignored.
So, you might be hitting https://github.com/tensorflow/tensorflow/issues/41630
Please try with TF_CUDNN_RESET_RND_GEN_STATE=1
env var
It can also be just a symptom of GPU OOM.
Yeah retried with that env var and still same error. stops at about 7th step with 1 batch size. Is there a large amount of memory required for running cudNN with n_hidden 2048 size?
If so, I may have to work on getting more GPU power before I can successfully fine-tune.
Is there a large amount of memory required for running cudNN with n_hidden 2048 size?
Try reducing the network size, you’ll know quite fast
Yep. Tried to train with --n_hidden 100
with --train_cudnn
and it was able to train for awhile. Maybe my GPU just can’t handle cudnn. My gpu usage skyrockets whenever I use that flag.