To begin with, thank you for creating this amazing effort and initiative. I’ve spent a few weeks of reading and experimenting with a DeepSpeech and I’m thrilled!
I’m building an offline voice assistant and think I think I’ve got all parts worked out. However I have a few questions that would really make me move forward. If everything works out I’m going to publish an updated tutorial on 0.7.4
I’m creating a voice assistant for a specific domain with ~120 words distributed typically over 50 phrases in variation.
It’s English for now but will move on to Swedish, my mother tounge when I’ve got something working.
I’m aiming to build an offline raspberry pi 4 experience meaning that I’m using the TensorFlow lite model.
I’m collecting my own training data, crowdsourced in a similar way common voice does it.
Q1. Does it matter if I collect recordings of words or phrases for the training?
Q2. Does it matter if I trim silence in my recording at the end and beginning?
Q3. Does it matter if I normalize then sound files and “polish them up” a bit?
Q4. How much training data do I need? Would eg 100 people saying all phrases 3 times be enough for a non fine-tune approach?
Q5. If using the fine tuning approach, how many epochs does make sense to run? Should I use negative epochs as stated in some post here?
Q6. If using the fine tuning approach, can I update the scorer or use my own?
Q7. What are some reasonable hyper parameters? My best bet right now for a training set of 4500 phrases are 400 hidden layers, 0.3 dropout, 0.0001 learning rate.
Q8. How do you guys avoid overfitting? I’m looking at tensorboard but that doesn’t show test curve and I simply just see that train and dev are converging but that’s it.
Q9. Does early stop prevents overfitting?
Q10. How do i show WER (the same output as after training)
So many question would be enormously grateful to just get a few answered by anyone that knows.