Query regarding speed of training and issues with convergence

I am training a model on my own marathi dataset but i’ve noticed that the training is faster in the initial steps of the epoch and it keeps getting slower to the end. Is this because of SortaGrad? or is it unrelated? i have 4 GPUS. 2 titan XP and 2 Titan V and i’ve given the training process all of them.

±----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26 Driver Version: 430.26 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN V Off | 00000000:18:00.0 Off | N/A |
| 61% 83C P2 146W / 250W | 11961MiB / 12066MiB | 100% Default |
±------------------------------±---------------------±---------------------+
| 1 TITAN V Off | 00000000:3B:00.0 Off | N/A |
| 66% 84C P2 157W / 250W | 11961MiB / 12066MiB | 100% Default |
±------------------------------±---------------------±---------------------+
| 2 TITAN Xp Off | 00000000:86:00.0 Off | N/A |
| 68% 85C P2 164W / 250W | 11843MiB / 12196MiB | 100% Default |
±------------------------------±---------------------±---------------------+
| 3 TITAN Xp Off | 00000000:AF:00.0 On | N/A |
| 61% 85C P2 184W / 250W | 11868MiB / 12193MiB | 100% Default |
±------------------------------±---------------------±---------------------+

also, my loss is stuck at 164. the dataset has 4-9 word sentences. almost 8k in training. 8 hrs. and it’s generated using a tts so it’s very clean(no abberations which would come of human error).

config :
Most are defaults from flags.py
the ones i changed are:
train_batch_size=20
dev_batch_size=2
test_batch_size=2
no_earlystop

Can you provide better status ? “Faster” and “slower” are not really useful qualifications.

They don’t have the same characteristics, do they?

8 hours is not a lot, if you have not changed learning rate and dropout, it’s not surprising.

I have not checked how slow it gets exactly, but it does. I’ll check it soon and post the details on that. I am not doing it right now(also because it might be trivial) because i am trying to fix the output( which looks like this always :
WER: 1.000000, CER: 1.000000, loss: 163.570694

  • src: “ंथैफ:गऻ:ळऩॊणळऀ:ंभॊढ:णॅण़धॊबफऻथ:लऻ:घैजऽ:थणऽ”
  • res: “”
    ).

they do not. but i have trained multiple other tensorflow models on these. i’ll try training them on the V’s if that’s one thing you are concerned about.

i just want to see if it overfits. i increased the LR to 0.001 and dropout to 0.2. n_hidden is at 1024 right now. testing that out. doesnt look good at the moment(17 epochs in loss 161 train, 145 test (on 4-9 length sentences)).

WER: 1.000000, CER: 0.940299, loss: 0.004681

  • src: “विमान चालविणाऱ्या वैमानिकाला अनेक गोष्टींकडे सतत अवधान द्यावे लागते”
  • res: "मला "

why is the loss this low(in comparison to the fact that the loss started out at 500), when the output is not even close?
what scale is it on. what do you guys expect usually.

It could slow down for a lot of reasons, and all our training do not show any slow down … We can’t really know for sure without more details on your system: maybe you have a setup limited in memory, and there’s some leakage (some people complained about that, but we have not been able to get to reproduce / meaningfull feedback).

What i’m concerned about is that they don’t have the same speed, so maybe TensorFlow is just affected and what you see as slowdown is normal because of your setup.

Please get inspired from the LDC93S1 sample overfit.

I thought so too. i’ll have to look into that. each card has 11 gigs and both have hbm2 memory. titan v’s have considerably more cuda cores though(5k on titan V vs 3k on titan Xp). I suspect the leak more than the different architectures; if it’s not sortagrad at work. I’ll post here if i find the answer.

Also, if you mean RAM, i have 256 gigs. and I monitor the process and everything is cool at that end.

I managed to get better results with my data. The problem was the language model which i built. i didnt think the LM would detoriate the results(which was stupid to assume). i have results which look more like the source now without using my trie or LM. I still have something i forgot to check today; In the experiment which gave me better results, all i did was skip the LM and trie options in the command line arguement but i know the flags.py has a default which is the LM binaty and trie for english which you guys have on your repo. I am not sure if the default lm and trie was used or not(even if it was, since it cant predict on unknown charset, it should not affect the output, i assume again, hopefully not wrong this time), but i was wondering how i woulf go about disabling the lm and trie altogether for initial experiments.

And about the loss being low, the loss is on the text predicted by deepspeech network before postproc by the lm and trie, and since the actual predicted text was better than the postprocesed text, the loss was low.

256GB ? Should be more than enough. Except if something badly leaks (some people complained, they somehow upgraded some python package and it got fixed)

If you set alpha and beta to 0.0 it should

yes

I will try that.

That’s expected:

Generally I see that the Titan V is faster on this kind of task, you should try only training first with the V’s, the weight sync can be a bottleneck limiting the true power of the V’s, I mean the V’s waiting for the XP’s to complete the epoch to sync is not really a good thing.

Try using the nvidia optimized container for auto mixed precision training (only with both Titan V)?

Ah! Thank you for the clarification. so i was’nt wrong on that part

True. I’ll try this soon and post timings. Or maybe it’ll be better if i post my entire findings after i manage to get my Marathi model running.

Right now:
WER: 1.000000, CER: 0.579710, loss: 0.000543

  • src: “अवशिष्टाचे एक चांगले उदाहरण स्फीनोडॉन या सरड्यासारख्या प्राण्याचे होय”
  • res: “अवउाताेउातगेउदहउसउतनोतेनउयाउसळउयाउसाउयाउपायओेउहोय”

**note spaces missing()
there is very little semblence in the results to the source but it’s better than just one word. i am trying to what lissyx suggested and setting lm_alpha and beta to 0 and then seeing if the LM was the problem.

The next experiment i have is to build a heavy duty language model and then try this experiment again.

That looks wrong alphabet format, please share it. Time ago my LM training was failing on reading the alphabet properly and then removing spaces, I suggest to try the native clients without LM to see if you get the same results.

1 Like

I do believe that might be the issue. i fixed the empty string res by updating the alphabets. Does space have to be the first character? also my charset has a zerowidth space.

This is my alphabet set for now.

उ
प
ा
ळ
न
म
ो
इ
ग
अ
ॲ
ऍ
ह
ऴ
ऊ
त
थ
ज
े
य
भ
ओ
द
ॅ
स
व
ू
ञ
ऋ
।
ख
ि
ब
ध
ी
ु
फ
ऩ
ई  #line below this is space. This comment is not in the alphabet file.
 
ट
ै
ऱ
ः
॰
ऌ
ौ
ॉ
॥
क
ॠ
झ
्
ठ
श
ँ
ल
ऐ
ॻ
़
ऑ
ऽ
ड
औ
ङ
ण
ढ           #the line under this is zero width space.Original file does not have this comment

ए
ष
च
छ
ं
ृ
आ
घ
र

I don’t know how your language works, what is the zero with space is used for?

It does’nt need it. It was a remnant in the cleaning processes. do you think that might be the issue? If so, i’ll clean up the data agin and try.

is the alphabet format wonky though? or is it ok.

Yes, remove it. Please try posting the alphabet again using the forum format, </>

It will be easier to read

Ok. On it.

I’ve updated the previous comment to have it as preformatted text.

Without the zero space yes.

New alphabet.txt :

उ
प
ा
ळ
न
म
ो
इ
ग
अ
ॲ
ऍ
ह
ऴ
ऊ
त
थ
ज
े
य
भ
ओ
द
ॅ
स
व
ू
ञ
ऋ
।
ख
ि
ब
ध
ी
ु
फ
ऩ
ई
 
ट
ै
ऱ
ः
॰
ऌ
ौ
ॉ
॥
क
ॠ
झ
्
ठ
श
ँ
ल
ऐ
ॻ
़
ऑ
ऽ
ड
औ
ङ
ण
ढ
ए
ष
च
छ
ं
ृ
आ
घ
र

WER: 1.000000, CER: 0.500000, loss: 0.732161

  • src: “याला आपले अबोध हेतू कारण असतात”
  • res: “याउापेअउोहेउतूउाअसउतात”

WER: 1.000000, CER: 0.631579, loss: 0.948461

  • src: “अनोमा बौद्ध साहित्यात निर्देशिलेली भारतातील एक पवित्र नदी”
  • res: “अनउादउसाउतयातऊनउेेउातातउपवतन”

output is still the same. any insights on if the alphabets should be ordered or not?

For the first time no, if you train a model then change the alphabet order then yes. Did you test it without the LM and reatrained the LM?

Looks like उ is being used as space maybe? change the space with उ I mean position

1 Like