Training on Persian dataset cannot converge

lissyx · July 15, 2020, 1:06pm

We can’t tell you the parameters that will work, it depends on your data.

So yeah, 40h, it’s not surprising you are not getting interesting results.

Mohammad_Hossein_Elahimanesh · August 3, 2020, 11:29am

After editing some of parameters , my model converged.
(
learning_rate: 0.0003
batch: 16
n_hidden: 1024
)

But
WER and CER are very bad in test data.
No improve cann’t be seen in test results.

Some of my test output:

WER: 1.000000, CER: 1.000000, loss: 213.787552

wav: file://data/clips/common_voice_fa_20115457.wav
src: “من به راگبی علاقه ندارم”
res: “”

Median WER:

WER: 1.000000, CER: 0.666667, loss: 2.507114

wav: file://data/clips/common_voice_fa_19211506.wav
src: “می خوام برا آخرین بار”
res: “یامکخرنا”

WER: 1.000000, CER: 0.821429, loss: 2.494120

wav: file://data/clips/common_voice_fa_19293030.wav
src: “او صورتحساب را پرداخت می کند”
res: “ستسبرک”

WER: 1.000000, CER: 0.774194, loss: 2.488260

wav: file://data/clips/common_voice_fa_19209341.wav
src: “او پایکس پیک را پشت سر می گذارد”
res: “پاییرگسر”

lissyx · August 3, 2020, 12:16pm

Are you still working on 40h of data? If so and you are training from scratch, it’s not unexpected.

You might want to re-generate Common Voice data to allow more duplicates, like doable in https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train
https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/CONTRIBUTING.md#build-the-image

Especially the duplicate_sentence_count parameters. With 270h hours and those parameters, it should start to be better.

Can you share complete test output?

At least the plot of loss for training / validation looks quite good

What have you used for external scorer ?

Mohammad_Hossein_Elahimanesh · August 10, 2020, 8:24am

No. At this time, we use as much as possible data for train. about 270 h.

flags.py don’t have duplicate_sentence_count flag.

I don’t use scorer during training faze. But data/lm/kenlm.scorer is used as default in flags.py.
It is nessesary to use external scorer and train a new scorer?

lissyx · August 10, 2020, 8:51am

In the Dockerfile.train linked earlier.

Default scorer is made for english data + alphabet, it’s not going to work with other data …

othiele · August 10, 2020, 9:21am

It looks like you are already using a custom scorer as you get some Persian output. A good scorer will significantly improve your WER.

Mohammad_Hossein_Elahimanesh · September 8, 2020, 5:30am

Hi, Olaf, and all of my friends on this page.
After training a Persian scorer, all thing works very well.
Thank you for all notes.

Test epoch | Steps: 169 | Elapsed Time: 0:01:13
Test on data/clips/test.csv - WER: 0.037417, CER: 0.021683, loss: 8.717824

sample output:

Best WER:

WER: 0.000000, CER: 0.000000, loss: 46.499149

wav: file://data/clips/common_voice_fa_18566240.wav
src: “و اینجاست که سرمایهگذاران آگاه میتوانند از سرمایهگذاران کنشگر تقلید کنند”
res: “و اینجاست که سرمایهگذاران آگاه میتوانند از سرمایهگذاران کنشگر تقلید کنند”

WER: 0.000000, CER: 0.000000, loss: 46.358776

wav: file://data/clips/common_voice_fa_18600525.wav
src: “نمیدونم افسرده شدم یا بقایای جنین تو دلم مونده که اینجوری میشم”
res: “نمیدونم افسرده شدم یا بقایای جنین تو دلم مونده که اینجوری میشم”

WER: 0.000000, CER: 0.000000, loss: 40.054466

wav: file://data/clips/common_voice_fa_18615441.wav
src: “که به تعضیف و خلق ناپایداریهای ژئوپلیتک منجر میشود”
res: “که به تعضیف و خلق ناپایداریهای ژئوپلیتک منجر میشود”

WER: 0.000000, CER: 0.000000, loss: 38.130241

wav: file://data/clips/common_voice_fa_18627888.wav
src: “یک برنامه مرتبط با سلامتی اطلاعات شما را به شرکت بیمه درمان میفروشد”
res: “یک برنامه مرتبط با سلامتی اطلاعات شما را به شرکت بیمه درمان میفروشد”

WER: 0.000000, CER: 0.000000, loss: 37.861706

wav: file://data/clips/common_voice_fa_18568948.wav
src: “ترافیک شدیدیم بود و دیدن نور ماشینا و چراغا و لامپهای مراکز تجاری حس خوبی بهم میدادن”
res: “ترافیک شدیدیم بود و دیدن نور ماشینا و چراغا و لامپهای مراکز تجاری حس خوبی بهم میدادن”

Median WER:

WER: 0.000000, CER: 0.000000, loss: 3.321916

wav: file://data/clips/common_voice_fa_20945891.wav
src: “هلال احمر”
res: “هلال احمر”

WER: 0.000000, CER: 0.000000, loss: 3.310491

wav: file://data/clips/common_voice_fa_19244502.wav
src: “خیلی خوب عزیزم زود باش”
res: “خیلی خوب عزیزم زود باش”

WER: 0.000000, CER: 0.000000, loss: 3.299302

wav: file://data/clips/common_voice_fa_21583196.wav
src: “او و خواهرش به صورت اتفاقی گروههای خونی یکسانی داشتند”
res: “او و خواهرش به صورت اتفاقی گروههای خونی یکسانی داشتند”

WER: 0.000000, CER: 0.000000, loss: 3.288944

wav: file://data/clips/common_voice_fa_21576305.wav
src: “اگه نخوردیم نون گندم دیدیم دست مردم”
res: “اگه نخوردیم نون گندم دیدیم دست مردم”

WER: 0.000000, CER: 0.000000, loss: 3.286442

wav: file://data/clips/common_voice_fa_21576584.wav
src: “کيسهاى حاوی دوده”
res: “کيسهاى حاوی دوده”

Worst WER:

WER: 1.000000, CER: 0.800000, loss: 15.691487

wav: file://data/clips/common_voice_fa_21000540.wav
src: “صدا و سیما”
res: “سونا”

WER: 1.000000, CER: 0.800000, loss: 13.577458

wav: file://data/clips/common_voice_fa_19333145.wav
src: “من گرسنهام”
res: “گرم و”

WER: 1.000000, CER: 0.636364, loss: 12.150107

wav: file://data/clips/common_voice_fa_20765379.wav
src: “کباب می کند”
res: “ما یک”

WER: 1.333333, CER: 0.142857, loss: 4.585015

wav: file://data/clips/common_voice_fa_19214371.wav
src: “معذرت میخوامموفق باشی”
res: “معظرت می خوام موفق باشی”

WER: 1.666667, CER: 0.238095, loss: 16.671534

wav: file://data/clips/common_voice_fa_18226107.wav
src: “معذرت ميخوامموفق باشي”
res: “معظرت می خوام موفق باشی”

othiele · September 8, 2020, 7:04am

The Median WERs are all 0, this looks like you might be overfitting your data.

Have you tried some fresh audio files on your trained model? Or what is the WER of the test set (which is not in train or dev)?

Mohammad_Hossein_Elahimanesh · September 8, 2020, 7:40am

Test epoch | Steps: 169 | Elapsed Time: 0:01:13
Test on data/clips/test.csv - WER: 0.037417, CER: 0.021683, loss: 8.717824

But my test data was in the training files. Then the test was not Fairplay.

othiele · September 8, 2020, 8:03am

Just try your model with your own voice or some other new material and see how it is doing.

masoud_parpanchi · September 26, 2020, 12:23pm

Did You just updated alphabete.txt? or changed other parts of the code for Persian training?

And about your language model, did you clean text for language model before creating (num2word, remove punctuations etc…)? Because I am doing this preprocessing. I want to know your experience.

othiele · September 26, 2020, 12:34pm

Yes, you don’t use UTF-8 mode, but just change the alphabet to the Persian one and train. You will need your own scorer for Persian, the English one does not work.

Preprocessing should be same as for other NLP tasks in Persian, maybe look around or as you did here, ask others

masoud_parpanchi · September 26, 2020, 12:41pm

did you use transfer learning from Latin languages to gain this WER on test data?

I think that may help. Because many people used transfer learning from English to Spanish or German, and got better results. But I a, not sure this can help for Persian too

Masoome_Qolami · October 28, 2020, 7:31am

Hi Mohammad
would you please share the details of your training like the value of your parameters and your loss logs?!
I’m doing the same as you did and can’t get appropriate responses.
thanks in advance

Masoome_Qolami · October 28, 2020, 8:36am

and what is the size of the Persain text file for training scorer and what kind of sentences do you use?!

Masoome_Qolami · November 7, 2020, 6:14am

I think transfer learning is propper when your alphabets are the same (I mean Latin alphabet). but for Persian, because the alphabets are different so you can’t do transfer learning.

i181237 · November 20, 2020, 3:16am

hey can you share your steps to make your own scorer file .
I’m unable to get the scorer when i run the below command`

!./generate_scorer_package --alphabet /gdrive/My\ Drive/dataset/UrduAlphabet_newscrawl.txt --lm /gdrive/My\ Drive/urdu_lm/lm.binary
–package /content/gdrive/My\ Drive/dataset/kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284 --vocab /content/gdrive/My\ Drive/urdu_lm/vocab-500

got below error when run this

500000 unique words read from vocabulary file.
Doesn’t look like a character based (Bytes Are All You Need) model.
–force_bytes_output_mode was not specified, using value infered from vocabulary contents: false
Invalid label 0
`

othiele · November 20, 2020, 7:53am

Really? Reuben gave you the answer 2 days ago. Please don’t hijack older threds, but read what we post:

Pak · March 2, 2021, 2:46pm

Dear Mohammad,
Can you make share changes you make to the original DeepSpeech repo to work with Arabic/ Farsi script?

lissyx · March 2, 2021, 2:54pm

If there are fixes for training arabic/farsi, please send PR …