Blank res when building/training my own model and vocabulary

Hello,

I’m trying to build a model for demo purpose with a very short vocabulary (7 differents french words: a, b, c, d, e suivant, retour, sauvegarde). In order to do that, I create my own .arpa file and .binary file (with kenlm) then my trie using the same alphabet as the english pre-trained model.
My recordings are mono 16bits and 16kHz.
First time I trained to see if everything was good, it works but due to having not enough data, result were not good.
I record more data and add some and now when I try to train again with a new arpa, binary and trie, I get blank inteference in test, resulting to WER = 100%.

I looked into some topics that said it may be a recording’s format error (not my case) or a character missing in my alphabet. I use checrk_character.py to see if any of them was missing but no.
I tend to think it’s an alphabet problem because this behaviour start after I add some french-CV data and modify the alphabet (and regenerate arpa,binary and trie).
So once I get this error, I came back to my old alphabet and data (+ binary and trie) but the error is still here, which makes me wonder what causes this…

Here my command :

python -u DeepSpeech.py --show_progressbar \
  --train_files data/train.csv \
  --test_files data/test.csv \
  --train_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 1024 \
  --epochs 1 \
  --checkpoint_dir .. \
  --export_dir .. \
  --summary_dir .. \
  --lm_binary_path ../lm.binary \
  --lm_trie_path ../trie \
  --alphabet_config_path data/alphabet.txt \

and here what kind of output I get :

WER: 1.000000, CER: 1.000000, loss: 4.078539
 - src: "a"
 - res: ""

My question is: what can be the cause of this behaviour ?

I redo multiple times all the steps, in order to be sure that I didn’t mess somewhere down the road, but still the same issue… And It’s not an environment problem cause with a similar project but with other data (all month in french), it works well.

I work on this since 2 days, my brain is stuck and I can’t find the cause so any help is greatly welcome !
Thank you very much
If you need more information, ask me :slight_smile:

EDIT:
My vocabulary.txt file :

a
b
c
d
e
suivant
retour
sauvegarde

My alphabet.txt file :

# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
 
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
'
# The last (non-comment) line needs to end with a newline.

I check my recording’s format with audacity (that’s the tool I use to create my recordings), may be an other way to check format is better ?

Note that even when I remove lm and trie from parameters, it still returns blank res

That’s going to be your problem: you have a small dataset and vocabulary. Are you training on your own data ?

With only one epoch, the model does not learn anything.

Hi @lissyx,
Thanks for your reponse :slight_smile:

Yes, it’s my own data. I also add CV and YouTube data to my dataset.

Do I need more epochs then ? I’ll try it and get back to you.

How many hours, approximately ?

Is it french YouTube ? I’m interested if the license is okay, to include it in the set of sources for a french model.

Yes, many more. Given you changed the network width, I’m not sure how much, to give a ballpark, and other hyper-parameters might need different tuning as well.

Well, I first want it to return something, even a wrong answer is fine, the blank res is killing me… I’ll improve hyperparameters after.

Well for now as it is more a PoC than but a real use, we have less than 1 hour of data. We just want it to work on our case, in parallel we add data but we can’t wait for our dataset to be big enough to start training, we’re very limited in time…
At terms we aim to have ~5/10 hours of this specific data (will it be enough ?I don’t really know what size we need as our vocabulary is very small )
For Youtube, yes it is french but I only take letters audio (ex someone spelling the alphabet in french with different speaker and accents) for my usage.
Don’t know about license, I think I can give you my data but I’ve to check with my superior…

You were right, it is a matter of number of epochs, it works (and overfit) with 100 epochs on a small part of my dataset and on my full data set after 30 +30+100 epochs it still return blank. I’ll try without CV data as they are sentences and not words like the others and it might simplify the training without them.

If you only have 1h of data for the full training, then it’s possible your network is still too big right now.

should I divided it by2, 4 or 8 ?

Honestly, I don’t have a good idea. I know I get barely usable model with ~200-250h of french on width of 2048. So I’d say try lower, maybe 256, and adjust according to the behavior ?

Ok, I’ll start from 256 and adjust after some try and fail :stuck_out_tongue:
Thanks a lot, I’ll keep you in touch of the evolution !

hi @lissyx,

just a quick question, is there an order to optimize hyperparameters ? I saw that you define learning_rate then drop_out_rate and after epochs, but what of beta1, beta2, epsilon, lm_alpha and lm_beta ? Which one do I have to focus first then second then…

I understand that the modus operandi is more try and fail than automatic…

Thanks again for your help,