Sharing my 100h of single speaker (Spanish)

Hello, just to share the data that I’m currently using, it contains 50h of reviewed speech and 50h of aligned speech but not reviewed. To review I’m currently using a DeepSpeech model, where the transcription matches the DS prediction I mark it as valid.

Most of the time we have a limited amount data to train (in this case for Spanish), the idea of this dataset is to use it as base and try to adapt it for a new a voice with way less data.

Using LJSpeech format!

Enjoy and please share any feedback.

Hi, can you please indicate where does this audios come from? In order not to repeat audios among my dataset. Thanks!

Hello, from https://librivox.org/reader/3946

Here’s the list of books (ignore the txt)

12 mayo txt
a la granja txt
bailen txt
cadiz txt
capitan veneno txt
carlos cuart txt
diablo cojuelo txt
el doncel de don enrique el doliente txt
escudero txt
gerona txt
historia de heródoto txt
juan martin txt
la batalla arapiles txt
la eneida txt
luchana txt
mendizábal txt
montes de oca txt
napoleón en chamartín txt
naufragio txt
señor de bembimbre txt
trafalgar txt
vergara txt
zumalacárregui txt

1 Like

@carlfm01 That’s great you share your hard work. I can also place your link to the main repo, if you like.

Yes, feel free to link it, or I can also send the PR

Either-way is fine. What is your github name?

carlfm01