lissyx:
The only solution is to reduce the model complexity, n_hidden=2048
with the current models. The temporal complexity of the model is depending quadratically on this. So roughly, if you say it takes 5 seconds to run inference on a 2 seconds audio file, it means you are mostly 2.5x slower than realtime. From there you should be able to infer the complexity of the model you require.
But re-training from scratch with smaller model dimension is going to be a complicated process: you need thousands of hours of audio, and several tentative to adjust the parameters.
Set --n_hidden
to 800
?
The remaining parameters are correct?
python3 DeepSpeech.py --n_hidden 800 --checkpoint_dir path / to / checkpoint / folder --epochs 3 --train_files my-train.csv --dev_files my-dev.csv --test_files my_dev.csv --learning_rate 0.0001
On which computer and how long can such training take? About. If you use Common Voice data for the English language.
Can be specified in dozens of hours.
Thank.