Multilingual Dataset Combiner/Cleaner

(Silenter) #1

First of all…

Cheers and thanks for this amazing Project. I nearly waited for four Years for something like this.
I can’t explain how grateful I am…

So after I had reached 12% in german, I was determined to reach these results with other languages aswell.

long story short…
I build some python scripts to collect/sort and clean datasets for deepspeech.
You can prepare trainings data with just one command.

i tried to make it as user-friendly and convenient as possible.

Any suggestions or questions are welcome. If you have any idea for future features, let me know.

and share your results/arguments.

If you know some more datasets for the languages below plz share them with me.
Maybe you can write your government if they are holding back data like in the netherlands. They are damaging only themselves…
I will integrate them aswell.

Datasets so far :
common voice
spoken wiki (aligner is broken - will be fixed with the next version)
african accented french
nicolas french

i won’t put the download links for the cv dataset in the db because of the agb.
Whoever i will create some options to insert the links after you accepted the agb’s and received the links.

Tests so far:

de = 12% lost data and graph

de uppercase = 18.9%

pl = 13.7%

es = 13.9%

it = 18.4%

fr = 22.7%

uk = 29.9%

ru = 36.9%

nl = 39.6%

pt = 50.7%

cs = not enough data
lt = not enough data
da = not enough data
et = not enough data
fi = not enough data
ro = not enough data
sq = not enough data
bg = not enough data
hr = not enough data
el = not enough data
ca = not enough data

(Yv) #2

did you train from scratch or fine tuned existing models?

(Silenter) #3

everything is done from scratch.
i rly tried to make it as user friendly as possible.
you need only 2 commands and a third to start the training via created trainings script

It’s a Downloader,Text Crawler/Sorter,Audio Analyzer/Converter combined.
Everything is saved in a sql database and you can then create datasets with specific rules and arguments.

for example a french dataset with only male adults and a duration between 1-15 sec.
Theres also an option to insert replacement rules.

all the results above had default values. pretty sure someone will find better arguments.

(Yv) #4

Great, I’m looking forward to the relase.

For the datasets - sizes in hours and perhaps stats about length distribution (min, max, median) might be useful for reference for others.

Have you tried any force alignment in the analyzer/converter?

(Silenter) #5

Yeah in some early tests with german but not in any of the results above. The results failed but it had nothing to do with the alignment.

But you have the option to convert/align all wavs in your database. I just didn’t tried it out yet :stuck_out_tongue:

(Yv) #6

ok, thanks for info, i’ll check the code once published

(Lissyx) #7

@nicolaspanel contributed some data for french, I’ve already made some tests with Common Voice fr data and it gave quite nice results:

(Silenter) #8

added nicolas corpora.
Just started training with librivox/vox/nicolas combined

(Silenter) #9

okay, i released it anyway. :wink:

(Carlos Fonseca) #10

Hi, for the Spanish model did you used voxforge? If yes did you clean it? I used the Windows Speech Recognition to score the confidence of each sentence then sort them, I’ve found that the top 80 were totally wrong.
You think you can share the validation set? Would be great to play around with it.

(Silenter) #11

i used librivox/tatoeba/vox combined.
My tool automatically cleans all the sentences(translate number, replace symbols/abbreviations/days/months/currencies etc. ).

Better play around with my tool… :smiley:

Pretty sure some native speakers can correct the mistakes i made and get even higher scores!

(Lissyx) #12

How much did you re-invented the wheel there? I have a lot of processing like that tailored for french:

It’d benefit everyone you send a PR also to