The entire training set of the “en_1087h_2019-06-12” release was used, 88 hours 30 minutes total.
No. The training parameters are specified in the release notes. Any parameters not mentioned there were set to default values. Online augmentation is disabled by default.
In our current cluster setup, augmented runs take over 20x longer to train than a non-augmented run. That plus the need to tune the hyperparameters for our case, plus wanting to get 0.6.0 out soon, meant we chose not to experiment with it for now.
@reuben Is there any particular reason the amount of training data used (3816h) is less then the corpora should provide? The entire Fisher, LibriSpeech, CommonVoice English and Switchboard Corpora which the model is trained on should sum up to 5069h.
Because the majority of the data in the current CV release is duplicate sentences. This is because the text corpora for English was very small until earlier this year when Sentence Collector launched (and then the 1M Wikipedia sentences were added later on). So future dataset releases should have a better ratio of duplicates to non-duplicates.
I assumed the full set from each of them would have been taken but I guess this is not the case since CommonVoice for example had duplicates and probably there is some percentage being used for development/testing purpose?