Releasing my Spanish dataset - 120h of public domain data

Hello, here’s another contribution.

My automatically validated dataset that contains 120h of clean Spanish speech from Librivox.

To download the dataset got to : https://www.kaggle.com/carlfm01/120h-spanish-speech/

Motivation to release it
120h is almost nothing compared to the amount of data required to train general purpose model, but for Spanish there’s almost no public datasets to train or even to test other than the voxforge , the main goal of this dataset is to share insight around a common set.

The issue with no test set for Spanish:

I’ll copy the methodology from the Github project

Automatically aligned the text with the Windows speech recognition, then as validation of the alignment used a Mozilla’s DeepSpeech model using a few different language models. The first model to start the validation was trained on voxforge Spanish data and on top of the ones that scored the highest confidence with the Windows Speech.

I saw few people suggested to use DeepSpeech models to do the validation, well here’s the result of that idea.

Is my first dataset release so, feedback well appreciated :slight_smile:

This probably interest @daniel.cruzado @nukeador and @mar_martinez

6 Likes

Thanks a lot!

It will be very useful.

I have converted too some Librivox audiobooks matching them with Gutenberg texts (all of them free of use), I will upload the audios to GitHub if they are not duplicated with yours, and share the link soon.

Best regards,
Mar