Training on Persian dataset cannot converge

We can’t tell you the parameters that will work, it depends on your data.

So yeah, 40h, it’s not surprising you are not getting interesting results.

2 Likes

After editing some of parameters , my model converged.
(
learning_rate: 0.0003
batch: 16
n_hidden: 1024
)

But
WER and CER are very bad in test data.
No improve cann’t be seen in test results.

Some of my test output:

WER: 1.000000, CER: 1.000000, loss: 213.787552

  • wav: file://data/clips/common_voice_fa_20115457.wav
  • src: “من به راگبی علاقه ندارم”
  • res: “”

Median WER:

WER: 1.000000, CER: 0.666667, loss: 2.507114

  • wav: file://data/clips/common_voice_fa_19211506.wav
  • src: “می خوام برا آخرین بار”
  • res: “یامکخرنا”

WER: 1.000000, CER: 0.821429, loss: 2.494120

  • wav: file://data/clips/common_voice_fa_19293030.wav
  • src: “او صورتحساب را پرداخت می کند”
  • res: “ستسبرک”

WER: 1.000000, CER: 0.774194, loss: 2.488260

  • wav: file://data/clips/common_voice_fa_19209341.wav
  • src: “او پایکس پیک را پشت سر می گذارد”
  • res: “پاییرگسر”

Are you still working on 40h of data? If so and you are training from scratch, it’s not unexpected.

You might want to re-generate Common Voice data to allow more duplicates, like doable in https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train
https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/CONTRIBUTING.md#build-the-image

Especially the duplicate_sentence_count parameters. With 270h hours and those parameters, it should start to be better.

Can you share complete test output?

At least the plot of loss for training / validation looks quite good

What have you used for external scorer ?

1 Like

No. At this time, we use as much as possible data for train. about 270 h.

flags.py don’t have duplicate_sentence_count flag.

I don’t use scorer during training faze. But data/lm/kenlm.scorer is used as default in flags.py.
It is nessesary to use external scorer and train a new scorer?

In the Dockerfile.train linked earlier.

Default scorer is made for english data + alphabet, it’s not going to work with other data …

It looks like you are already using a custom scorer as you get some Persian output. A good scorer will significantly improve your WER.

1 Like

Hi, Olaf, and all of my friends on this page.
After training a Persian scorer, all thing works very well.
Thank you for all notes.

Test epoch | Steps: 169 | Elapsed Time: 0:01:13
Test on data/clips/test.csv - WER: 0.037417, CER: 0.021683, loss: 8.717824

sample output:

Best WER:

WER: 0.000000, CER: 0.000000, loss: 46.499149

  • wav: file://data/clips/common_voice_fa_18566240.wav
  • src: “و اینجاست که سرمایهگذاران آگاه میتوانند از سرمایهگذاران کنشگر تقلید کنند”
  • res: “و اینجاست که سرمایهگذاران آگاه میتوانند از سرمایهگذاران کنشگر تقلید کنند”

WER: 0.000000, CER: 0.000000, loss: 46.358776

  • wav: file://data/clips/common_voice_fa_18600525.wav
  • src: “نمیدونم افسرده شدم یا بقایای جنین تو دلم مونده که اینجوری میشم”
  • res: “نمیدونم افسرده شدم یا بقایای جنین تو دلم مونده که اینجوری میشم”

WER: 0.000000, CER: 0.000000, loss: 40.054466

  • wav: file://data/clips/common_voice_fa_18615441.wav
  • src: “که به تعضیف و خلق ناپایداریهای ژئوپلیتک منجر میشود”
  • res: “که به تعضیف و خلق ناپایداریهای ژئوپلیتک منجر میشود”

WER: 0.000000, CER: 0.000000, loss: 38.130241

  • wav: file://data/clips/common_voice_fa_18627888.wav
  • src: “یک برنامه مرتبط با سلامتی اطلاعات شما را به شرکت بیمه درمان میفروشد”
  • res: “یک برنامه مرتبط با سلامتی اطلاعات شما را به شرکت بیمه درمان میفروشد”

WER: 0.000000, CER: 0.000000, loss: 37.861706

  • wav: file://data/clips/common_voice_fa_18568948.wav
  • src: “ترافیک شدیدیم بود و دیدن نور ماشینا و چراغا و لامپهای مراکز تجاری حس خوبی بهم میدادن”
  • res: “ترافیک شدیدیم بود و دیدن نور ماشینا و چراغا و لامپهای مراکز تجاری حس خوبی بهم میدادن”

Median WER:

WER: 0.000000, CER: 0.000000, loss: 3.321916

  • wav: file://data/clips/common_voice_fa_20945891.wav
  • src: “هلال احمر”
  • res: “هلال احمر”

WER: 0.000000, CER: 0.000000, loss: 3.310491

  • wav: file://data/clips/common_voice_fa_19244502.wav
  • src: “خیلی خوب عزیزم زود باش”
  • res: “خیلی خوب عزیزم زود باش”

WER: 0.000000, CER: 0.000000, loss: 3.299302

  • wav: file://data/clips/common_voice_fa_21583196.wav
  • src: “او و خواهرش به صورت اتفاقی گروههای خونی یکسانی داشتند”
  • res: “او و خواهرش به صورت اتفاقی گروههای خونی یکسانی داشتند”

WER: 0.000000, CER: 0.000000, loss: 3.288944

  • wav: file://data/clips/common_voice_fa_21576305.wav
  • src: “اگه نخوردیم نون گندم دیدیم دست مردم”
  • res: “اگه نخوردیم نون گندم دیدیم دست مردم”

WER: 0.000000, CER: 0.000000, loss: 3.286442

  • wav: file://data/clips/common_voice_fa_21576584.wav
  • src: “کيسهاى حاوی دوده”
  • res: “کيسهاى حاوی دوده”

Worst WER:

WER: 1.000000, CER: 0.800000, loss: 15.691487

  • wav: file://data/clips/common_voice_fa_21000540.wav
  • src: “صدا و سیما”
  • res: “سونا”

WER: 1.000000, CER: 0.800000, loss: 13.577458

  • wav: file://data/clips/common_voice_fa_19333145.wav
  • src: “من گرسنهام”
  • res: “گرم و”

WER: 1.000000, CER: 0.636364, loss: 12.150107

  • wav: file://data/clips/common_voice_fa_20765379.wav
  • src: “کباب می کند”
  • res: “ما یک”

WER: 1.333333, CER: 0.142857, loss: 4.585015

  • wav: file://data/clips/common_voice_fa_19214371.wav
  • src: “معذرت میخوامموفق باشی”
  • res: “معظرت می خوام موفق باشی”

WER: 1.666667, CER: 0.238095, loss: 16.671534

  • wav: file://data/clips/common_voice_fa_18226107.wav
  • src: “معذرت ميخوامموفق باشي”
  • res: “معظرت می خوام موفق باشی”

The Median WERs are all 0, this looks like you might be overfitting your data.

Have you tried some fresh audio files on your trained model? Or what is the WER of the test set (which is not in train or dev)?

Test epoch | Steps: 169 | Elapsed Time: 0:01:13
Test on data/clips/test.csv - WER: 0.037417, CER: 0.021683, loss: 8.717824

But my test data was in the training files. Then the test was not Fairplay.

Just try your model with your own voice or some other new material and see how it is doing.

Did You just updated alphabete.txt? or changed other parts of the code for Persian training?

And about your language model, did you clean text for language model before creating (num2word, remove punctuations etc…)? Because I am doing this preprocessing. I want to know your experience.

1 Like

Yes, you don’t use UTF-8 mode, but just change the alphabet to the Persian one and train. You will need your own scorer for Persian, the English one does not work.

Preprocessing should be same as for other NLP tasks in Persian, maybe look around or as you did here, ask others :slight_smile:

2 Likes

did you use transfer learning from Latin languages to gain this WER on test data?

I think that may help. Because many people used transfer learning from English to Spanish or German, and got better results. But I a, not sure this can help for Persian too :thinking:

Hi Mohammad
would you please share the details of your training like the value of your parameters and your loss logs?!
I’m doing the same as you did and can’t get appropriate responses.
thanks in advance

and what is the size of the Persain text file for training scorer and what kind of sentences do you use?!

I think transfer learning is propper when your alphabets are the same (I mean Latin alphabet). but for Persian, because the alphabets are different so you can’t do transfer learning.

1 Like

hey can you share your steps to make your own scorer file .
I’m unable to get the scorer when i run the below command`

!./generate_scorer_package --alphabet /gdrive/My\ Drive/dataset/UrduAlphabet_newscrawl.txt --lm /gdrive/My\ Drive/urdu_lm/lm.binary
–package /content/gdrive/My\ Drive/dataset/kenlm.scorer --default_alpha 0.931289039105002 --default_beta 1.1834137581510284 --vocab /content/gdrive/My\ Drive/urdu_lm/vocab-500

got below error when run this

500000 unique words read from vocabulary file.
Doesn’t look like a character based (Bytes Are All You Need) model.
–force_bytes_output_mode was not specified, using value infered from vocabulary contents: false
Invalid label 0
`

Really? Reuben gave you the answer 2 days ago. Please don’t hijack older threds, but read what we post:

Dear Mohammad,
Can you make share changes you make to the original DeepSpeech repo to work with Arabic/ Farsi script?

If there are fixes for training arabic/farsi, please send PR …