Bad training results

Hello,

I am trying to train a simple model for investigative purposes, to recognize digits and/or letters, spoken in portuguese.

My dataset is composed of about 800 samples. 300 samples of digits and 500 of letters, each of them is about 1s long. The audios’ sample rate is 48k.
For each training, I divided the audios 70-15-15 for train-dev-test respectively. The audios were well shuffled.

The steps I followed were the ones detailed on this page https://deepspeech.readthedocs.io/en/master/TRAINING.html

I am currently using the master version of deepspeech, and I do not have an nvidia GPU, therefore I used the regular tensorflow version.

The best results I got came from separating the digits from the letters, and running the training on each set separately, with the following params, which I chose after some research around this forum and any online info I could find.

n_hidden : 370,
epochs: 3000,
dropout_rate: 0.3
learning_rate: 0.001
feature_win_len: 25,
feature_win_step: 10,
audio_sample_rate: 48000

The models trained with the above params resulted in a 95% WER for the digits, and 99% WER for the letters, after training for about 15hrs, at which point I noticed it wasn’t improving anymore.

All the other models resulted in a 100% WER, which output blank results (" "). Which were trained with several combintions of the above params such as:
n_hidden: 512, 1024, 2048
dropout_rate:0, 0.1, 0.2, 0.3
learning_rate: 0.01, 0.001, 0.0001
default feature_win_len and _step.

I also tried training a model with the common voice Portuguese dataset (700MB), which resulted in the same problem: 100% WER, with blank results (" "). I did not dedicate much time to trying different combinations for this set though.

Honestly it feels like I’m diong something wrong, even though I think I followed the steps pretty well.

I’m confused about some things though:

  • In earlier deepspeech versions, and many tutorials I find, the _lm and _trie files were necesasary. Is that not the case for the master version? (it is not present in the tutorial) - Could this be my problem? I have tried creating a _scorer file and using it in the -scorer_path flag, but I ran into some problems.

  • Is it the lack of a GPU?

  • Is my dataset too small? - In this case I would expect to at least achieve an overfitting model easily, no?

  • In the tutorial steps, the common voice audios are first imported with the bin/import_cv2.py file, “For bringing this data into a form that DeepSpeech understands”. Do I have to run any kind of similar pre processing on my audios aswell?

  • Any problems with my audios having a sample rate of 48k? the only thing I did about this was always running the training with the audio_sample_rate set to 48000.

Any insights would be really appreciated,
Thanks in advance!

We can’t be responsible for other’s outdated tutorial. LM and trie are still there and they are merged together in a scorer file. We can’t help and fix “some problem” if you don’t share details.

No

It could be, but training on CPU how many epochs could you complete ?

This is not a value we have feedback on, we know it works well with 16k and 8k, but 48k might be different.

Importers are here to produce CSV files that DeepSpeech can ingest, as well as some basic filtering, only you have your data and thus are able to answer whether this is needed.

See above.

I see no mention of your alphabet, and you don’t share your training command line. Please be exhaustive.

Thank you for your answer!

With a small n_hidden (I tried 256 and 370), the highest I went was around 3000.
With higher n_hidden (2048) I can’t remember for sure, but less than 10, because they were taking too long.

I have my csv’s setup with the format wav_filename,wav_filesize,transcript

for the alphabet I have a file with the 26 letters, one in each line. (I’m trying to train the letters only for now, forget the digits)

my command line looks like this:

This is the result the training converges to after ~2hrs, ran WITHOUT the scorer_path flag

I have never succeded in running it with my own scorer. I’m getting a segmentation fault at the end of the training when I do it

How I generated the scorer:

Create myVocabulary.txt - same as my alphabet in my case, right? just the 26 letters of the alphabet, one per line.

Build kenlm:

./lmplz --text <myVocabulary.txt> --arpa words.arpa --o 5 --memory 10% --discount_fallback

./build_binary -T -s -v words.arpa lm.binary

python data/lm/generate_package.py --alphabet <alphabet.txt> --lm <mylm.binary> --vocab <myVocabulary.txt> --default_alpha 0.5 --default_beta 0.5 --package …/…/deliver/

The last command outputs the following:
26 unique words read from vocabulary file.
Looks like a character based model.
ERROR: AlignOutput: Can’t determine stream position
ERROR: Could not align file during write after header
Package created in …/…/…/…/deliver/

Yeah but then your network might not be big enough. We don’t have good overview on small values like that

And in that case, you just have not trained enough

And transcripts are valid ? All audio are the same sample rate, channels, etc.?

You need to ensure you are using the exact same (up to same ordering) everywhere.

Likely because you are building it wrongly. Please refer to our docs, not outdated tutorials.

Was lm.binary properly created ?

You don’t mention what version of deepspeech.

Your vocabulary is just one letter per line? It might not work …

master version

the output of ./build_binary -T -s -v words.arpa lm.binary was ok, no errors.

yes and yes

Yes, how should it be? all of transcripts are simple letters. From a-z

Please use generate_lm.py

Well, I’m just saying that in your case, it might not be good. Maybe try to not use a scorer, pass an empty string: --scorer ""

Ran 20 epochs with --scorer “”, and the same hyperparams as before. got a 0.95 WER and CER, with a loss of 3.31. By the 10th epoch it was barely improving :confused:

Just curious, how small can n_hidden be in your experiments? Like if we decrease n_hidden a bit lower than that sweet spot then loss/WER explode.