Sharing my 100h of single speaker (Spanish)

carlfm01 · October 10, 2019, 3:38am

Hello, just to share the data that I’m currently using, it contains 50h of reviewed speech and 50h of aligned speech but not reviewed. To review I’m currently using a DeepSpeech model, where the transcription matches the DS prediction I mark it as valid.

Most of the time we have a limited amount data to train (in this case for Spanish), the idea of this dataset is to use it as base and try to adapt it for a new a voice with way less data.

Using LJSpeech format!

Enjoy and please share any feedback.

reyxuan · September 19, 2019, 6:25am

Hi, can you please indicate where does this audios come from? In order not to repeat audios among my dataset. Thanks!

carlfm01 · September 19, 2019, 6:45am

Hello, from https://librivox.org/reader/3946

Here’s the list of books (ignore the txt)

12 mayo txt
a la granja txt
bailen txt
cadiz txt
capitan veneno txt
carlos cuart txt
diablo cojuelo txt
el doncel de don enrique el doliente txt
escudero txt
gerona txt
historia de heródoto txt
juan martin txt
la batalla arapiles txt
la eneida txt
luchana txt
mendizábal txt
montes de oca txt
napoleón en chamartín txt
naufragio txt
señor de bembimbre txt
trafalgar txt
vergara txt
zumalacárregui txt

erogol · September 20, 2019, 7:06am

@carlfm01 That’s great you share your hard work. I can also place your link to the main repo, if you like.

carlfm01 · September 20, 2019, 7:32am

Yes, feel free to link it, or I can also send the PR

erogol · September 20, 2019, 10:36am

Either-way is fine. What is your github name?

carlfm01 · September 20, 2019, 12:02pm

carlfm01