Training on Persian dataset cannot converge

Mohammad_Hossein_Elahimanesh · July 15, 2020, 10:00am

Hi
My train command:
python3 DeepSpeech.py --train_files data/clips/train-all.csv --dev_files data/clips/dev.csv --test_files data/clips/test.csv

and the log of training presented below:

after 8 epoch, validation loss doesn’t reduce.

Also
when using small training data and total epoche finished, the output of the network doesn’t change by changing the input voice. My model always responds a unique string as output.

lissyx · July 15, 2020, 10:02am

How much data do you have ?
You might need to adjust parameters for the training …

Mohammad_Hossein_Elahimanesh · July 15, 2020, 10:07am

2385 file in train.csv
34998 file in train-all.csv
2130 file in dev.csv
2244 file in test.csv
113456 in validation.csv

lissyx · July 15, 2020, 10:12am

How much data, not how much files.

Mohammad_Hossein_Elahimanesh · July 15, 2020, 10:16am

Persian dataset contins about 230,000 files and about 270 hours speech.

lissyx · July 15, 2020, 10:19am

270h advertised on Common Voice ? How much once imported in DeepSpeech training ? Have you had a look at my message about parameters ?

Mohammad_Hossein_Elahimanesh · July 15, 2020, 10:28am

Yes, 270h advertised on Common Voice.

I don’t change default parameters.
About 40 hours training data when use train-all.csv file.

othiele · July 15, 2020, 10:39am

Definitely change the default params for dropout and batch size. Dropout to 0.3 or higher, try higher values. Batch size for faster runs, try 4,8,…

lissyx · July 15, 2020, 1:06pm

We can’t tell you the parameters that will work, it depends on your data.

So yeah, 40h, it’s not surprising you are not getting interesting results.

Mohammad_Hossein_Elahimanesh · August 3, 2020, 11:29am

After editing some of parameters , my model converged.
(
learning_rate: 0.0003
batch: 16
n_hidden: 1024
)

But
WER and CER are very bad in test data.
No improve cann’t be seen in test results.

Some of my test output:

WER: 1.000000, CER: 1.000000, loss: 213.787552

wav: file://data/clips/common_voice_fa_20115457.wav
src: “من به راگبی علاقه ندارم”
res: “”

Median WER:

WER: 1.000000, CER: 0.666667, loss: 2.507114

wav: file://data/clips/common_voice_fa_19211506.wav
src: “می خوام برا آخرین بار”
res: “یامکخرنا”

WER: 1.000000, CER: 0.821429, loss: 2.494120

wav: file://data/clips/common_voice_fa_19293030.wav
src: “او صورتحساب را پرداخت می کند”
res: “ستسبرک”

WER: 1.000000, CER: 0.774194, loss: 2.488260

wav: file://data/clips/common_voice_fa_19209341.wav
src: “او پایکس پیک را پشت سر می گذارد”
res: “پاییرگسر”

lissyx · August 3, 2020, 12:16pm

Are you still working on 40h of data? If so and you are training from scratch, it’s not unexpected.

You might want to re-generate Common Voice data to allow more duplicates, like doable in https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train
https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/CONTRIBUTING.md#build-the-image

Especially the duplicate_sentence_count parameters. With 270h hours and those parameters, it should start to be better.

Can you share complete test output?

At least the plot of loss for training / validation looks quite good

What have you used for external scorer ?

Mohammad_Hossein_Elahimanesh · August 10, 2020, 8:24am

No. At this time, we use as much as possible data for train. about 270 h.

flags.py don’t have duplicate_sentence_count flag.

I don’t use scorer during training faze. But data/lm/kenlm.scorer is used as default in flags.py.
It is nessesary to use external scorer and train a new scorer?

lissyx · August 10, 2020, 8:51am

In the Dockerfile.train linked earlier.

Default scorer is made for english data + alphabet, it’s not going to work with other data …

othiele · August 10, 2020, 9:21am

It looks like you are already using a custom scorer as you get some Persian output. A good scorer will significantly improve your WER.

Mohammad_Hossein_Elahimanesh · September 8, 2020, 5:30am

Hi, Olaf, and all of my friends on this page.
After training a Persian scorer, all thing works very well.
Thank you for all notes.

Test epoch | Steps: 169 | Elapsed Time: 0:01:13
Test on data/clips/test.csv - WER: 0.037417, CER: 0.021683, loss: 8.717824

sample output:

Best WER:

WER: 0.000000, CER: 0.000000, loss: 46.499149

wav: file://data/clips/common_voice_fa_18566240.wav
src: “و اینجاست که سرمایهگذاران آگاه میتوانند از سرمایهگذاران کنشگر تقلید کنند”
res: “و اینجاست که سرمایهگذاران آگاه میتوانند از سرمایهگذاران کنشگر تقلید کنند”

WER: 0.000000, CER: 0.000000, loss: 46.358776

wav: file://data/clips/common_voice_fa_18600525.wav
src: “نمیدونم افسرده شدم یا بقایای جنین تو دلم مونده که اینجوری میشم”
res: “نمیدونم افسرده شدم یا بقایای جنین تو دلم مونده که اینجوری میشم”

WER: 0.000000, CER: 0.000000, loss: 40.054466

wav: file://data/clips/common_voice_fa_18615441.wav
src: “که به تعضیف و خلق ناپایداریهای ژئوپلیتک منجر میشود”
res: “که به تعضیف و خلق ناپایداریهای ژئوپلیتک منجر میشود”

WER: 0.000000, CER: 0.000000, loss: 38.130241

wav: file://data/clips/common_voice_fa_18627888.wav
src: “یک برنامه مرتبط با سلامتی اطلاعات شما را به شرکت بیمه درمان میفروشد”
res: “یک برنامه مرتبط با سلامتی اطلاعات شما را به شرکت بیمه درمان میفروشد”

WER: 0.000000, CER: 0.000000, loss: 37.861706

wav: file://data/clips/common_voice_fa_18568948.wav
src: “ترافیک شدیدیم بود و دیدن نور ماشینا و چراغا و لامپهای مراکز تجاری حس خوبی بهم میدادن”
res: “ترافیک شدیدیم بود و دیدن نور ماشینا و چراغا و لامپهای مراکز تجاری حس خوبی بهم میدادن”

Median WER:

WER: 0.000000, CER: 0.000000, loss: 3.321916

wav: file://data/clips/common_voice_fa_20945891.wav
src: “هلال احمر”
res: “هلال احمر”

WER: 0.000000, CER: 0.000000, loss: 3.310491

wav: file://data/clips/common_voice_fa_19244502.wav
src: “خیلی خوب عزیزم زود باش”
res: “خیلی خوب عزیزم زود باش”

WER: 0.000000, CER: 0.000000, loss: 3.299302

wav: file://data/clips/common_voice_fa_21583196.wav
src: “او و خواهرش به صورت اتفاقی گروههای خونی یکسانی داشتند”
res: “او و خواهرش به صورت اتفاقی گروههای خونی یکسانی داشتند”

WER: 0.000000, CER: 0.000000, loss: 3.288944

wav: file://data/clips/common_voice_fa_21576305.wav
src: “اگه نخوردیم نون گندم دیدیم دست مردم”
res: “اگه نخوردیم نون گندم دیدیم دست مردم”

WER: 0.000000, CER: 0.000000, loss: 3.286442

wav: file://data/clips/common_voice_fa_21576584.wav
src: “کيسهاى حاوی دوده”
res: “کيسهاى حاوی دوده”

Worst WER:

WER: 1.000000, CER: 0.800000, loss: 15.691487

wav: file://data/clips/common_voice_fa_21000540.wav
src: “صدا و سیما”
res: “سونا”

WER: 1.000000, CER: 0.800000, loss: 13.577458

wav: file://data/clips/common_voice_fa_19333145.wav
src: “من گرسنهام”
res: “گرم و”

WER: 1.000000, CER: 0.636364, loss: 12.150107

wav: file://data/clips/common_voice_fa_20765379.wav
src: “کباب می کند”
res: “ما یک”

WER: 1.333333, CER: 0.142857, loss: 4.585015

wav: file://data/clips/common_voice_fa_19214371.wav
src: “معذرت میخوامموفق باشی”
res: “معظرت می خوام موفق باشی”

WER: 1.666667, CER: 0.238095, loss: 16.671534

wav: file://data/clips/common_voice_fa_18226107.wav
src: “معذرت ميخوامموفق باشي”
res: “معظرت می خوام موفق باشی”

othiele · September 8, 2020, 7:04am

The Median WERs are all 0, this looks like you might be overfitting your data.

Have you tried some fresh audio files on your trained model? Or what is the WER of the test set (which is not in train or dev)?

Mohammad_Hossein_Elahimanesh · September 8, 2020, 7:40am

Test epoch | Steps: 169 | Elapsed Time: 0:01:13
Test on data/clips/test.csv - WER: 0.037417, CER: 0.021683, loss: 8.717824

But my test data was in the training files. Then the test was not Fairplay.

othiele · September 8, 2020, 8:03am

Just try your model with your own voice or some other new material and see how it is doing.

masoud_parpanchi · September 26, 2020, 12:23pm

Did You just updated alphabete.txt? or changed other parts of the code for Persian training?

And about your language model, did you clean text for language model before creating (num2word, remove punctuations etc…)? Because I am doing this preprocessing. I want to know your experience.

othiele · September 26, 2020, 12:34pm

Yes, you don’t use UTF-8 mode, but just change the alphabet to the Persian one and train. You will need your own scorer for Persian, the English one does not work.

Preprocessing should be same as for other NLP tasks in Persian, maybe look around or as you did here, ask others