0.6 model questions

Congratulations on getting 0.6 out! I had some questions about the pre-trained model that weren’t mentioned in the release notes.

  1. How many hours of data was the model trained on in total?

  2. How many hours of that were Common Voice data?

  3. Was the model trained with augmentation switched on?

3816 hours.

The entire training set of the “en_1087h_2019-06-12” release was used, 88 hours 30 minutes total.

No. The training parameters are specified in the release notes. Any parameters not mentioned there were set to default values. Online augmentation is disabled by default.

1 Like

Thanks! Was there a particular reason augmentation wasn’t switched on or was it just because it would have taken longer?

In our current cluster setup, augmented runs take over 20x longer to train than a non-augmented run. That plus the need to tune the hyperparameters for our case, plus wanting to get 0.6.0 out soon, meant we chose not to experiment with it for now.

Congratulations! Seems like it brings a huge number of improvements, I look forward to trying it out further.

My initial tests found it quite a bit faster and the trained model copes reasonably well with my British English (despite the US English training)

Just out of curiosity, was that vast amount of NPR training data prepared using the DS Align project? (sorry for my earlier typo with the name!)

Yes, it was aligned using DSAlign.

1 Like

@reuben Is there any particular reason the amount of training data used (3816h) is less then the corpora should provide? The entire Fisher, LibriSpeech, CommonVoice English and Switchboard Corpora which the model is trained on should sum up to 5069h.

Why the jump from 1087h to 88 hours?

I don’t know how you got to that number, I calculated 3816h directly from the training CSVs.

Because the majority of the data in the current CV release is duplicate sentences. This is because the text corpora for English was very small until earlier this year when Sentence Collector launched (and then the 1M Wikipedia sentences were added later on). So future dataset releases should have a better ratio of duplicates to non-duplicates.

@dabinat Thanks! That’s very interesting to know

Seems I haven’t looked at the training data itself but rather at each of the corpora:

CommonVoice: 1087h, LibriSpeech 1000h, Fisher 2742h, Switchboard: 240h

I assumed the full set from each of them would have been taken but I guess this is not the case since CommonVoice for example had duplicates and probably there is some percentage being used for development/testing purpose?

Is this applicable for 0.7.0 as well? The same 3816h for 0.7.0 as well?

Also was Switchboard cellular Part 1 used apart from Switchboard 1

Is it possible to know how much of Fisher was used? Seems like that is conversational and most related to the kind of data that I am looking at.

Is it possible get the training csv that you used?