We are evaluating Deep Speech to transcribe audio files in Brazilian Portuguese. They are calls made by users across the country to different call centers. To size the effort required in the training stage, we need to calculate the size of the samples. How do I properly calculate the sample size?
The kind of issue you may run into with the current model is if you train with small (e.g., 30 secs) samples and then you try to use on much bigger (30 mins). So I would say the sample size depends also on your needs. It might be easier to train on small samples (30-60 secs) and then rely on VAD to cut audio at inference ?