Releasing my Spanish dataset - 120h of public domain data

carlfm01 · June 9, 2019, 9:01am

Hello, here’s another contribution.

My automatically validated dataset that contains 120h of clean Spanish speech from Librivox.

To download the dataset got to : 120h Spanish Speech | Kaggle

Motivation to release it
120h is almost nothing compared to the amount of data required to train general purpose model, but for Spanish there’s almost no public datasets to train or even to test other than the voxforge , the main goal of this dataset is to share insight around a common set.

The issue with no test set for Spanish:

I’ll copy the methodology from the Github project

Automatically aligned the text with the Windows speech recognition, then as validation of the alignment used a Mozilla’s DeepSpeech model using a few different language models. The first model to start the validation was trained on voxforge Spanish data and on top of the ones that scored the highest confidence with the Windows Speech.

I saw few people suggested to use DeepSpeech models to do the validation, well here’s the result of that idea.

Is my first dataset release so, feedback well appreciated

This probably interest @daniel.cruzado @nukeador and @mar_martinez

mar_martinez · June 10, 2019, 5:25pm

Thanks a lot!

It will be very useful.

I have converted too some Librivox audiobooks matching them with Gutenberg texts (all of them free of use), I will upload the audios to GitHub if they are not duplicated with yours, and share the link soon.

Best regards,
Mar

Topic		Replies	Views
Sharing my 100h of single speaker (Spanish) TTS (Text-to-Speech)	6	2468	September 20, 2019
Cuando estará disponible? Español (es)	4	3534	September 9, 2020
Nuevo conjunto de datos de mitad de año: ¡Más datos, más idiomas! Español (es)	10	2153	July 6, 2019
Using common voice datasets? DeepSpeech	5	1107	November 17, 2020
Multilingual Dataset Combiner/Cleaner DeepSpeech	16	2632	June 18, 2019

Releasing my Spanish dataset - 120h of public domain data

Related topics