New transcribed dataset: MLS

nmstoker · December 16, 2020, 1:02am

This new transcribed dataset looks like it could be very helpful - I don’t know how feasible (or desirable) it would be to incorporate it into training for the next release of the main English model, but with 44.5k hrs transcribed it compares well to the amount of transcribed audio on the earlier LibriSpeech dataset (1,000 hrs)

It also has quantities of transcribed audio for other languages too but those are less dramatic (but could still be a big help compared with what’s available for them too)

synesthesiam · December 16, 2020, 3:39am

Awesome, thank you for the link!

eziolotta · January 18, 2021, 10:12am

Hi, We have implemented an MLS Importer for the Italian speech dataset.

github.com

MozillaItalia/DeepSpeech-Italian-Model/blob/master/MITADS-Speech/mls_importer.py


#!/usr/bin/env python3

import time
import os
import re
from corpora_importer import ArchiveImporter,Corpus

CORPUS_NAME = 'mls'

class MLSImporter(ArchiveImporter):


    def get_corpus(self): 
        ##extract training and development datasets
        ##do data merge, ArchiveImporter make final train/test/dev datasets
        utterances = {}
        audios = []        
        count=0
        for d in ("train","dev","test"):

This file has been truncated. show original

The work is not yet finalized and we would like to try various training tests.
MLS has audio clips ranging from 10 to 20 seconds.

lissyx · January 18, 2021, 11:40am

Please share that as a PR as soon as you can!

eziolotta · January 18, 2021, 2:45pm

MLS, and all importers we are doing recently, use our utility for common operations (corpora_importer.py), they depend on this utility.
Then we have a collector for generating a final speech dataset that aggregates all imported corpora.
DeepSpeech EN repo use a different strategy.
Should I also do a PR of the utilities?

In any case, the work on our corpora_collector has yet to be completed, hopefully soon.

lissyx · January 18, 2021, 2:50pm

That’s a good question, maybe it would be useful to move to that factorized code as well, what do you think @reuben ?