Training on Persian dataset cannot converge

Hi
My train command:
python3 DeepSpeech.py --train_files data/clips/train-all.csv --dev_files data/clips/dev.csv --test_files data/clips/test.csv

and the log of training presented below:

(((
I Could not find most recent checkpoint.
I Initializing all variables.
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 3:08:55 | Steps: 34997 | Loss: 78.528839
Epoch 0 | Validation | Elapsed Time: 0:04:16 | Steps: 2129 | Loss: 100.409679 | Dataset: data/clips/dev.csv
I Saved new best validating model with loss 100.409679 to: /root/.local/share/deepspeech/checkpoints/best_dev-34997
Epoch 1 | Training | Elapsed Time: 3:08:54 | Steps: 34997 | Loss: 77.350966
Epoch 1 | Validation | Elapsed Time: 0:04:15 | Steps: 2129 | Loss: 99.251050 | Dataset: data/clips/dev.csv
I Saved new best validating model with loss 99.251050 to: /root/.local/share/deepspeech/checkpoints/best_dev-69994
Epoch 2 | Training | Elapsed Time: 3:08:52 | Steps: 34997 | Loss: 77.263415
Epoch 2 | Validation | Elapsed Time: 0:04:16 | Steps: 2129 | Loss: 99.234356 | Dataset: data/clips/dev.csv
I Saved new best validating model with loss 99.234356 to: /root/.local/share/deepspeech/checkpoints/best_dev-104991
Epoch 3 | Training | Elapsed Time: 3:08:53 | Steps: 34997 | Loss: 77.190274
Epoch 3 | Validation | Elapsed Time: 0:04:15 | Steps: 2129 | Loss: 99.166183 | Dataset: data/clips/dev.csv
I Saved new best validating model with loss 99.166183 to: /root/.local/share/deepspeech/checkpoints/best_dev-139988
Epoch 4 | Training | Elapsed Time: 3:08:53 | Steps: 34997 | Loss: 77.247450
Epoch 4 | Validation | Elapsed Time: 0:04:16 | Steps: 2129 | Loss: 99.692980 | Dataset: data/clips/dev.csv
Epoch 5 | Training | Elapsed Time: 3:08:55 | Steps: 34997 | Loss: 77.273132
Epoch 5 | Validation | Elapsed Time: 0:04:15 | Steps: 2129 | Loss: 99.232199 | Dataset: data/clips/dev.csv
Epoch 6 | Training | Elapsed Time: 3:08:51 | Steps: 34997 | Loss: 77.262940
Epoch 6 | Validation | Elapsed Time: 0:04:15 | Steps: 2129 | Loss: 99.947124 | Dataset: data/clips/dev.csv
Epoch 7 | Training | Elapsed Time: 3:08:53 | Steps: 34997 | Loss: 77.264972
Epoch 7 | Validation | Elapsed Time: 0:04:16 | Steps: 2129 | Loss: 99.395676 | Dataset: data/clips/dev.csv
)))

after 8 epoch, validation loss doesn’t reduce.

Also
when using small training data and total epoche finished, the output of the network doesn’t change by changing the input voice. My model always responds a unique string as output.

How much data do you have ?
You might need to adjust parameters for the training …

1 Like

2385 file in train.csv
34998 file in train-all.csv
2130 file in dev.csv
2244 file in test.csv
113456 in validation.csv

How much data, not how much files.

Persian dataset contins about 230,000 files and about 270 hours speech.

270h advertised on Common Voice ? How much once imported in DeepSpeech training ? Have you had a look at my message about parameters ?

1 Like

Yes, 270h advertised on Common Voice.

I don’t change default parameters.
About 40 hours training data when use train-all.csv file.

Definitely change the default params for dropout and batch size. Dropout to 0.3 or higher, try higher values. Batch size for faster runs, try 4,8,…

1 Like

We can’t tell you the parameters that will work, it depends on your data.

So yeah, 40h, it’s not surprising you are not getting interesting results.

2 Likes

After editing some of parameters , my model converged.
(
learning_rate: 0.0003
batch: 16
n_hidden: 1024
)

But
WER and CER are very bad in test data.
No improve cann’t be seen in test results.

Some of my test output:

WER: 1.000000, CER: 1.000000, loss: 213.787552

  • wav: file://data/clips/common_voice_fa_20115457.wav
  • src: “من به راگبی علاقه ندارم”
  • res: “”

Median WER:

WER: 1.000000, CER: 0.666667, loss: 2.507114

  • wav: file://data/clips/common_voice_fa_19211506.wav
  • src: “می خوام برا آخرین بار”
  • res: “یامکخرنا”

WER: 1.000000, CER: 0.821429, loss: 2.494120

  • wav: file://data/clips/common_voice_fa_19293030.wav
  • src: “او صورتحساب را پرداخت می کند”
  • res: “ستسبرک”

WER: 1.000000, CER: 0.774194, loss: 2.488260

  • wav: file://data/clips/common_voice_fa_19209341.wav
  • src: “او پایکس پیک را پشت سر می گذارد”
  • res: “پاییرگسر”

Are you still working on 40h of data? If so and you are training from scratch, it’s not unexpected.

You might want to re-generate Common Voice data to allow more duplicates, like doable in https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/Dockerfile.train
https://github.com/Common-Voice/commonvoice-fr/blob/master/DeepSpeech/CONTRIBUTING.md#build-the-image

Especially the duplicate_sentence_count parameters. With 270h hours and those parameters, it should start to be better.

Can you share complete test output?

At least the plot of loss for training / validation looks quite good

What have you used for external scorer ?

1 Like

No. At this time, we use as much as possible data for train. about 270 h.

flags.py don’t have duplicate_sentence_count flag.

I don’t use scorer during training faze. But data/lm/kenlm.scorer is used as default in flags.py.
It is nessesary to use external scorer and train a new scorer?

In the Dockerfile.train linked earlier.

Default scorer is made for english data + alphabet, it’s not going to work with other data …

It looks like you are already using a custom scorer as you get some Persian output. A good scorer will significantly improve your WER.

1 Like

Hi, Olaf, and all of my friends on this page.
After training a Persian scorer, all thing works very well.
Thank you for all notes.

Test epoch | Steps: 169 | Elapsed Time: 0:01:13
Test on data/clips/test.csv - WER: 0.037417, CER: 0.021683, loss: 8.717824

sample output:

Best WER:

WER: 0.000000, CER: 0.000000, loss: 46.499149

  • wav: file://data/clips/common_voice_fa_18566240.wav
  • src: “و اینجاست که سرمایهگذاران آگاه میتوانند از سرمایهگذاران کنشگر تقلید کنند”
  • res: “و اینجاست که سرمایهگذاران آگاه میتوانند از سرمایهگذاران کنشگر تقلید کنند”

WER: 0.000000, CER: 0.000000, loss: 46.358776

  • wav: file://data/clips/common_voice_fa_18600525.wav
  • src: “نمیدونم افسرده شدم یا بقایای جنین تو دلم مونده که اینجوری میشم”
  • res: “نمیدونم افسرده شدم یا بقایای جنین تو دلم مونده که اینجوری میشم”

WER: 0.000000, CER: 0.000000, loss: 40.054466

  • wav: file://data/clips/common_voice_fa_18615441.wav
  • src: “که به تعضیف و خلق ناپایداریهای ژئوپلیتک منجر میشود”
  • res: “که به تعضیف و خلق ناپایداریهای ژئوپلیتک منجر میشود”

WER: 0.000000, CER: 0.000000, loss: 38.130241

  • wav: file://data/clips/common_voice_fa_18627888.wav
  • src: “یک برنامه مرتبط با سلامتی اطلاعات شما را به شرکت بیمه درمان میفروشد”
  • res: “یک برنامه مرتبط با سلامتی اطلاعات شما را به شرکت بیمه درمان میفروشد”

WER: 0.000000, CER: 0.000000, loss: 37.861706

  • wav: file://data/clips/common_voice_fa_18568948.wav
  • src: “ترافیک شدیدیم بود و دیدن نور ماشینا و چراغا و لامپهای مراکز تجاری حس خوبی بهم میدادن”
  • res: “ترافیک شدیدیم بود و دیدن نور ماشینا و چراغا و لامپهای مراکز تجاری حس خوبی بهم میدادن”

Median WER:

WER: 0.000000, CER: 0.000000, loss: 3.321916

  • wav: file://data/clips/common_voice_fa_20945891.wav
  • src: “هلال احمر”
  • res: “هلال احمر”

WER: 0.000000, CER: 0.000000, loss: 3.310491

  • wav: file://data/clips/common_voice_fa_19244502.wav
  • src: “خیلی خوب عزیزم زود باش”
  • res: “خیلی خوب عزیزم زود باش”

WER: 0.000000, CER: 0.000000, loss: 3.299302

  • wav: file://data/clips/common_voice_fa_21583196.wav
  • src: “او و خواهرش به صورت اتفاقی گروههای خونی یکسانی داشتند”
  • res: “او و خواهرش به صورت اتفاقی گروههای خونی یکسانی داشتند”

WER: 0.000000, CER: 0.000000, loss: 3.288944

  • wav: file://data/clips/common_voice_fa_21576305.wav
  • src: “اگه نخوردیم نون گندم دیدیم دست مردم”
  • res: “اگه نخوردیم نون گندم دیدیم دست مردم”

WER: 0.000000, CER: 0.000000, loss: 3.286442

  • wav: file://data/clips/common_voice_fa_21576584.wav
  • src: “کيسهاى حاوی دوده”
  • res: “کيسهاى حاوی دوده”

Worst WER:

WER: 1.000000, CER: 0.800000, loss: 15.691487

  • wav: file://data/clips/common_voice_fa_21000540.wav
  • src: “صدا و سیما”
  • res: “سونا”

WER: 1.000000, CER: 0.800000, loss: 13.577458

  • wav: file://data/clips/common_voice_fa_19333145.wav
  • src: “من گرسنهام”
  • res: “گرم و”

WER: 1.000000, CER: 0.636364, loss: 12.150107

  • wav: file://data/clips/common_voice_fa_20765379.wav
  • src: “کباب می کند”
  • res: “ما یک”

WER: 1.333333, CER: 0.142857, loss: 4.585015

  • wav: file://data/clips/common_voice_fa_19214371.wav
  • src: “معذرت میخوامموفق باشی”
  • res: “معظرت می خوام موفق باشی”

WER: 1.666667, CER: 0.238095, loss: 16.671534

  • wav: file://data/clips/common_voice_fa_18226107.wav
  • src: “معذرت ميخوامموفق باشي”
  • res: “معظرت می خوام موفق باشی”

The Median WERs are all 0, this looks like you might be overfitting your data.

Have you tried some fresh audio files on your trained model? Or what is the WER of the test set (which is not in train or dev)?

Test epoch | Steps: 169 | Elapsed Time: 0:01:13
Test on data/clips/test.csv - WER: 0.037417, CER: 0.021683, loss: 8.717824

But my test data was in the training files. Then the test was not Fairplay.

Just try your model with your own voice or some other new material and see how it is doing.