First of all…
Cheers and thanks for this amazing Project. I nearly waited for four Years for something like this.
I can’t explain how grateful I am…
So after I had reached 12% in german, I was determined to reach these results with other languages aswell.
long story short…
I build some python scripts to collect/sort and clean datasets for deepspeech.
You can prepare trainings data with just one command.
i tried to make it as user-friendly and convenient as possible.
Any suggestions or questions are welcome. If you have any idea for future features, let me know.
and share your results/arguments.
If you know some more datasets for the languages below plz share them with me.
Maybe you can write your government if they are holding back data like in the netherlands. They are damaging only themselves…
I will integrate them aswell.
Datasets so far :
spoken wiki (aligner is broken - will be fixed with the next version)
african accented french
i won’t put the download links for the cv dataset in the db because of the agb.
Whoever i will create some options to insert the links after you accepted the agb’s and received the links.
Tests so far:
de = 12% lost data and graph
de uppercase = 18.9%
cs = not enough data
lt = not enough data
da = not enough data
et = not enough data
fi = not enough data
ro = not enough data
sq = not enough data
bg = not enough data
hr = not enough data
el = not enough data
ca = not enough data