Hello, here’s another contribution.
My automatically validated dataset that contains 120h of clean Spanish speech from Librivox.
To download the dataset got to : https://www.kaggle.com/carlfm01/120h-spanish-speech/
Motivation to release it
120h is almost nothing compared to the amount of data required to train general purpose model, but for Spanish there’s almost no public datasets to train or even to test other than the voxforge , the main goal of this dataset is to share insight around a common set.
The issue with no test set for Spanish:
I’ll copy the methodology from the Github project
Automatically aligned the text with the Windows speech recognition, then as validation of the alignment used a Mozilla’s DeepSpeech model using a few different language models. The first model to start the validation was trained on voxforge Spanish data and on top of the ones that scored the highest confidence with the Windows Speech.
I saw few people suggested to use DeepSpeech models to do the validation, well here’s the result of that idea.
Is my first dataset release so, feedback well appreciated
This probably interest @daniel.cruzado @nukeador and @mar_martinez