New transcribed dataset: MLS

This new transcribed dataset looks like it could be very helpful - I don’t know how feasible (or desirable) it would be to incorporate it into training for the next release of the main English model, but with 44.5k hrs transcribed it compares well to the amount of transcribed audio on the earlier LibriSpeech dataset (1,000 hrs)

It also has quantities of transcribed audio for other languages too but those are less dramatic (but could still be a big help compared with what’s available for them too)

3 Likes

Awesome, thank you for the link!

Hi, We have implemented an MLS Importer for the Italian speech dataset.

The work is not yet finalized and we would like to try various training tests.
MLS has audio clips ranging from 10 to 20 seconds.

3 Likes

Please share that as a PR as soon as you can!

MLS, and all importers we are doing recently, use our utility for common operations (corpora_importer.py), they depend on this utility.
Then we have a collector for generating a final speech dataset that aggregates all imported corpora.
DeepSpeech EN repo use a different strategy.
Should I also do a PR of the utilities?

In any case, the work on our corpora_collector has yet to be completed, hopefully soon.

That’s a good question, maybe it would be useful to move to that factorized code as well, what do you think @reuben ?