Cheers and thanks for this amazing Project. I nearly waited for four Years for something like this.
I can’t explain how grateful I am…
So after I had reached 12% in german, I was determined to reach these results with other languages aswell.
long story short…
I build some python scripts to collect/sort and clean datasets for deepspeech.
You can prepare trainings data with just one command.
i tried to make it as user-friendly and convenient as possible.
Any suggestions or questions are welcome. If you have any idea for future features, let me know.
and share your results/arguments.
If you know some more datasets for the languages below plz share them with me.
Maybe you can write your government if they are holding back data like in the netherlands. They are damaging only themselves…
I will integrate them aswell.
Datasets so far :
common voice
voxforge
librivox
spoken wiki (aligner is broken - will be fixed with the next version)
tatoeba
tuda
zamia
vystadial
african accented french
nicolas french
i won’t put the download links for the cv dataset in the db because of the agb.
Whoever i will create some options to insert the links after you accepted the agb’s and received the links.
Tests so far:
de = 9.84%
pl = 13.7%
es = 13.9%
it = 18.4%
fr = 22.7%
uk = 29.9%
ru = 36.9%
nl = 39.6%
pt = 50.7%
cs = not enough data
lt = not enough data
da = not enough data
et = not enough data
fi = not enough data
ro = not enough data
sq = not enough data
bg = not enough data
hr = not enough data
el = not enough data
ca = not enough data
everything is done from scratch.
i rly tried to make it as user friendly as possible.
you need only 2 commands and a third to start the training via created trainings script
It’s a Downloader,Text Crawler/Sorter,Audio Analyzer/Converter combined.
Everything is saved in a sql database and you can then create datasets with specific rules and arguments.
for example a french dataset with only male adults and a duration between 1-15 sec.
Theres also an option to insert replacement rules.
all the results above had default values. pretty sure someone will find better arguments.
Hi, for the Spanish model did you used voxforge? If yes did you clean it? I used the Windows Speech Recognition to score the confidence of each sentence then sort them, I’ve found that the top 80 were totally wrong.
You think you can share the validation set? Would be great to play around with it.
i used librivox/tatoeba/vox combined.
My tool automatically cleans all the sentences(translate number, replace symbols/abbreviations/days/months/currencies etc. ).
Better play around with my tool…
Pretty sure some native speakers can correct the mistakes i made and get even higher scores!
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
12
@silenter Thanks for sharing the spanish model, is helping me a lot to review transcriptions in combination with the Windows speech recognition. Just one thing the, I’ve noticed that the output is showing á as á with the Windows client, which values did you used for the LM?
I think this is the same issue that @roseman mention to me, .NET is using default encoding that causes wrong outputs, would be great to see your solution @roseman
Hi, great work. Im using the german model for my work and so far its doing fine. Any chance to get the german model for v0.5.0? Or even better checkpoints?