0.6 model questions

dabinat · December 3, 2019, 7:38pm

Congratulations on getting 0.6 out! I had some questions about the pre-trained model that weren’t mentioned in the release notes.

How many hours of data was the model trained on in total?
How many hours of that were Common Voice data?
Was the model trained with augmentation switched on?

reuben · December 3, 2019, 8:03pm

3816 hours.

The entire training set of the “en_1087h_2019-06-12” release was used, 88 hours 30 minutes total.

No. The training parameters are specified in the release notes. Any parameters not mentioned there were set to default values. Online augmentation is disabled by default.

dabinat · December 3, 2019, 8:39pm

Thanks! Was there a particular reason augmentation wasn’t switched on or was it just because it would have taken longer?

reuben · December 3, 2019, 8:45pm

In our current cluster setup, augmented runs take over 20x longer to train than a non-augmented run. That plus the need to tune the hyperparameters for our case, plus wanting to get 0.6.0 out soon, meant we chose not to experiment with it for now.

nmstoker · December 4, 2019, 1:36pm

Congratulations! Seems like it brings a huge number of improvements, I look forward to trying it out further.

My initial tests found it quite a bit faster and the trained model copes reasonably well with my British English (despite the US English training)

Just out of curiosity, was that vast amount of NPR training data prepared using the DS Align project? (sorry for my earlier typo with the name!)

kdavis · December 4, 2019, 6:31am

Yes, it was aligned using DSAlign.

Johannes_Beiser · December 16, 2019, 12:29pm

@reuben Is there any particular reason the amount of training data used (3816h) is less then the corpora should provide? The entire Fisher, LibriSpeech, CommonVoice English and Switchboard Corpora which the model is trained on should sum up to 5069h.

Why the jump from 1087h to 88 hours?

reuben · December 16, 2019, 6:29pm

I don’t know how you got to that number, I calculated 3816h directly from the training CSVs.

dabinat · December 16, 2019, 7:51pm

Because the majority of the data in the current CV release is duplicate sentences. This is because the text corpora for English was very small until earlier this year when Sentence Collector launched (and then the 1M Wikipedia sentences were added later on). So future dataset releases should have a better ratio of duplicates to non-duplicates.

Johannes_Beiser · December 17, 2019, 10:08am

@dabinat Thanks! That’s very interesting to know

Seems I haven’t looked at the training data itself but rather at each of the corpora:

CommonVoice: 1087h, LibriSpeech 1000h, Fisher 2742h, Switchboard: 240h

I assumed the full set from each of them would have been taken but I guess this is not the case since CommonVoice for example had duplicates and probably there is some percentage being used for development/testing purpose?

A_N · May 22, 2020, 6:41pm

Is this applicable for 0.7.0 as well? The same 3816h for 0.7.0 as well?

Also was Switchboard cellular Part 1 used apart from Switchboard 1

Is it possible to know how much of Fisher was used? Seems like that is conversational and most related to the kind of data that I am looking at.

Is it possible get the training csv that you used?

thanks