Hello! I am training model to detect 9 commands in Finnish with dataset of 2200 words (train 70%, dev 20%, test10%). Words are about 1sec audio clips of given word.
Training config:
Snippet from train.csv run at the end of training:
So it is detectin test files with good accuracy. But when I run an interface with short audio clip with command in it, output is empy.
Same with mic_vad_streaming as it detects voice but output is empty.
If your model doesn’t work for new material you might have overfitting or not enough training. How many epochs did you train? You don’t have much material.
Your scorer should have all words combinations you want to detect. Leave out all parameters you can as they are for GBs of data. You are pruning too much.
Your alphabet should contain letters to recognize. I don’t know Finnish but it looks like you have words in there too. Might be problematic.
Empty output with many epochs on a smaller set is a bit strange. I would guess you get results with 15-20 epochs as you have few data. Check again with the new settings, chances are good.
Maybe use a smaller batch size, but that should only worsen the training by a bit.
Train loss is 0.4 but dev loss is 20. Could that point to the issue? Should there be some output still? Then again I think that at some earlier longer train dev loss got close to 2 and there was no output either.
Loss in itself is not that important for such a small amount of material. It is rather how they relate to each other. Typically they get closer for a while then test gets better and dev doesn’t. Then you are overfitting.
There is some material from Common Voice/Mozilla for just numbers. Maybe take that for English to get a feel how much you have to train how long. Maybe you just don’t have enough material.