Multilingual Dataset Combiner/Cleaner


(Silenter) #1

First of all…

Cheers and thanks for this amazing Project. I nearly waited for four Years for something like this.
I can’t explain how grateful I am…

So after I had reached 12% in german, I was determined to reach these results with other languages aswell.

long story short…
I build some python scripts to collect/sort and clean datasets for deepspeech.
You can prepare trainings data with just one command.

i tried to make it as user-friendly and convenient as possible.

Any suggestions or questions are welcome. If you have any idea for future features, let me know.

and share your results/arguments.

If you know some more datasets for the languages below plz share them with me.
Maybe you can write your government if they are holding back data like in the netherlands. They are damaging only themselves…
I will integrate them aswell.

Datasets so far :
common voice
voxforge
librivox
spoken wiki (aligner is broken - will be fixed with the next version)
tatoeba
tuda
zamia
vystadial
african accented french
nicolas french

i won’t put the download links for the cv dataset in the db because of the agb.
Whoever i will create some options to insert the links after you accepted the agb’s and received the links.

Tests so far:

de = 9.84%
https://drive.google.com/open?id=1quyJ9cHX4f5wEg3K3QayEmgqlhoYUPUd

pl = 13.7%
https://drive.google.com/open?id=14oDu1Kes2I16ReBhCJpAFETHVRETlT0N

es = 13.9%
https://drive.google.com/open?id=1Yw5SUbIzKUqsEQCwP-eoaTW492QYc1Ol

it = 18.4%
https://drive.google.com/open?id=14l-jx56zM84EWpfhkYT0gHc9cZZ-Ti9D

fr = 22.7%
https://drive.google.com/open?id=1tHNM-7HnPQBdooVgTxNl6F-pgVbRMk3h

uk = 29.9%
https://drive.google.com/open?id=1dQ5MzlkhjdiQpLCJDNqV1-Z2GsquXIx1

ru = 36.9%
https://drive.google.com/open?id=1eBm2aD0QGh8y5LgZP0MYqZresdcVIvgz

nl = 39.6%
https://drive.google.com/open?id=1eP8ug3qTUwodI3uEaofJ5xqjhMfjWUs6

pt = 50.7%
https://drive.google.com/open?id=1QE7PIUnQXS6X_t90O8bTiupJu5a0kPf-

cs = not enough data
lt = not enough data
da = not enough data
et = not enough data
fi = not enough data
ro = not enough data
sq = not enough data
bg = not enough data
hr = not enough data
el = not enough data
ca = not enough data


(Yv) #2

Nice,
did you train from scratch or fine tuned existing models?


(Silenter) #3

everything is done from scratch.
i rly tried to make it as user friendly as possible.
you need only 2 commands and a third to start the training via created trainings script

It’s a Downloader,Text Crawler/Sorter,Audio Analyzer/Converter combined.
Everything is saved in a sql database and you can then create datasets with specific rules and arguments.

for example a french dataset with only male adults and a duration between 1-15 sec.
Theres also an option to insert replacement rules.

all the results above had default values. pretty sure someone will find better arguments.


(Yv) #4

Great, I’m looking forward to the relase.

For the datasets - sizes in hours and perhaps stats about length distribution (min, max, median) might be useful for reference for others.

Have you tried any force alignment in the analyzer/converter?


(Silenter) #5

Yeah in some early tests with german but not in any of the results above. The results failed but it had nothing to do with the alignment.

But you have the option to convert/align all wavs in your database. I just didn’t tried it out yet :stuck_out_tongue:


(Yv) #6

ok, thanks for info, i’ll check the code once published


(Lissyx) #7

@nicolaspanel contributed some data for french, I’ve already made some tests with Common Voice fr data and it gave quite nice results: https://gitlab.com/nicolaspanel/TrainingSpeech


(Silenter) #8

Nice.
added nicolas corpora.
Just started training with librivox/vox/nicolas combined


(Silenter) #9

okay, i released it anyway. :wink:
hf


(Carlos Fonseca) #10

Hi, for the Spanish model did you used voxforge? If yes did you clean it? I used the Windows Speech Recognition to score the confidence of each sentence then sort them, I’ve found that the top 80 were totally wrong.
You think you can share the validation set? Would be great to play around with it.


(Silenter) #11

i used librivox/tatoeba/vox combined.
My tool automatically cleans all the sentences(translate number, replace symbols/abbreviations/days/months/currencies etc. ).

Better play around with my tool… :smiley:

Pretty sure some native speakers can correct the mistakes i made and get even higher scores!


(Lissyx) #12

How much did you re-invented the wheel there? I have a lot of processing like that tailored for french: https://github.com/Common-Voice/commonvoice-fr

It’d benefit everyone you send a PR also to https://github.com/mozilla/CorporaCreator


(Carlos Fonseca) #13

@silenter Thanks for sharing the spanish model, is helping me a lot to review transcriptions in combination with the Windows speech recognition. Just one thing the, I’ve noticed that the output is showing á as á with the Windows client, which values did you used for the LM?

I think this is the same issue that @roseman mention to me, .NET is using default encoding that causes wrong outputs, would be great to see your solution @roseman


(Roseman) #14

I just opened a pull request with the fix I found working for me here.


(Carlos Fonseca) #15

Thanks for the PR, I can confirm that the changes fixed the issue :slight_smile:


(Roseman) #16

So happy it was of help. I litterally was going crazy over that, my own language is non latin and I was only getting Japanese style!