Hello,
I am trying to train a simple model for investigative purposes, to recognize digits and/or letters, spoken in portuguese.
My dataset is composed of about 800 samples. 300 samples of digits and 500 of letters, each of them is about 1s long. The audios’ sample rate is 48k.
For each training, I divided the audios 70-15-15 for train-dev-test respectively. The audios were well shuffled.
The steps I followed were the ones detailed on this page https://deepspeech.readthedocs.io/en/master/TRAINING.html
I am currently using the master version of deepspeech, and I do not have an nvidia GPU, therefore I used the regular tensorflow version.
The best results I got came from separating the digits from the letters, and running the training on each set separately, with the following params, which I chose after some research around this forum and any online info I could find.
n_hidden : 370,
epochs: 3000,
dropout_rate: 0.3
learning_rate: 0.001
feature_win_len: 25,
feature_win_step: 10,
audio_sample_rate: 48000
The models trained with the above params resulted in a 95% WER for the digits, and 99% WER for the letters, after training for about 15hrs, at which point I noticed it wasn’t improving anymore.
All the other models resulted in a 100% WER, which output blank results (" "). Which were trained with several combintions of the above params such as:
n_hidden: 512, 1024, 2048
dropout_rate:0, 0.1, 0.2, 0.3
learning_rate: 0.01, 0.001, 0.0001
default feature_win_len and _step.
I also tried training a model with the common voice Portuguese dataset (700MB), which resulted in the same problem: 100% WER, with blank results (" "). I did not dedicate much time to trying different combinations for this set though.
Honestly it feels like I’m diong something wrong, even though I think I followed the steps pretty well.
I’m confused about some things though:
-
In earlier deepspeech versions, and many tutorials I find, the _lm and _trie files were necesasary. Is that not the case for the master version? (it is not present in the tutorial) - Could this be my problem? I have tried creating a _scorer file and using it in the -scorer_path flag, but I ran into some problems.
-
Is it the lack of a GPU?
-
Is my dataset too small? - In this case I would expect to at least achieve an overfitting model easily, no?
-
In the tutorial steps, the common voice audios are first imported with the bin/import_cv2.py file, “For bringing this data into a form that DeepSpeech understands”. Do I have to run any kind of similar pre processing on my audios aswell?
-
Any problems with my audios having a sample rate of 48k? the only thing I did about this was always running the training with the audio_sample_rate set to 48000.
Any insights would be really appreciated,
Thanks in advance!