Query regarding speed of training and issues with convergence

I figured it out. " "has to be the first character in the alphabet.txt

src: “हा प्राणी सरीसृपांच्या सरपटणाऱ्या प्राण्यांच्या र्हिंकोसीफॅलिया गणातला आहे”

  • res: “हा पान सू स पान या स पना या प ा या या हे तो स तॅ या गा त ा हे”

output is bad but it’s not trained enough. I got the space.

you are right. exactly why i tried putiing space first.

testing without the LM for now.

i’ll experiment and try to understand how better to standardize the alphabet formatting.

<This was a discussion we had to have in private because i had to wait 3 hrs because i am a new user. putting this here so people have some context.>

After this converstion and after a lot of fine tuning, i have some results i am satisfied with. will keep posting if i find anything interesting. also, after i test the model out, if everything is as it should be, i’ll post my protocol too. Thanks for helping me. @lissyx @carlfm01!!

results

--------------------------------------------------------------------------------
WER: 0.222222, CER: 0.033898, loss: 0.000426
 - src: "मध्यजीवमहाकल्पच्या अखेरपासून हे कुल लुप्त झाले असा समज होता"
 - res: "मध्यजीव महाकल्पाच्या अखेरपासून हे कुल लुप्त झाले असा समज होता"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 0.000007
 - src: "याला आपले अबोध हेतू कारण असतात"
 - res: "याला आपले अबोध हेतू कारण असतात"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 0.000015
 - src: "अवधानाची ही चंचलता जीवनोपयोगी असते"
 - res: "अवधानाची ही चंचलता जीवनोपयोगी असते"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 0.000016
 - src: "यांना समुद्री अवशिष्ट म्हणणे योग्य होईल"
 - res: "यांना समुद्री अवशिष्ट म्हणणे योग्य होईल"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 0.000021
 - src: "तसेच काही ठिकाणी थंड पाण्याची खोल सरोवरेही होती"
 - res: "तसेच काही ठिकाणी थंड पाण्याची खोल सरोवरेही होती"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 0.000021
 - src: "व रेखावृत्त ते पू"
 - res: "व रेखावृत्त ते पू"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 0.000037
 - src: "किमी लोकसंख्या आकारमानाने पोर्तुगालच्या सु"
 - res: "किमी लोकसंख्या आकारमानाने पोर्तुगालच्या सु"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 0.000072
 - src: "यायोगे प्राण्याला परिसरातील सगळ्या गोष्टींशी संपर्क ठेवता येतो"
 - res: "यायोगे प्राण्याला परिसरातील सगळ्या गोष्टींशी संपर्क ठेवता येतो"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 0.000085
 - src: "असा बौद्ध साहित्यात उल्लेख आहे"
 - res: "असा बौद्ध साहित्यात उल्लेख आहे"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 0.000089
 - src: "विमान चालविणाऱ्या वैमानिकाला अनेक गोष्टींकडे सतत अवधान द्यावे लागते"
 - res: "विमान चालविणाऱ्या वैमानिकाला अनेक गोष्टींकडे सतत अवधान द्यावे लागते"
--------------------------------------------------------------------------------

I had some wonky issues with alphabet.txt, will also clarify after i get some concrete understanding of what was wrong.

Can you share some of that ? It might be a valuable information for others !

@lissyx I’ll test the model thoroughly(tomorrow) and if everything goes as expected(read- results are not a fluke(unlikely)), then i’ll post everthing in another clean post. Also, i have had some issues with the alphabet handling; i still have to reason why my initial models failed.

As soon as i am done with this, i’ll post the whole config and my complete protocol. I do not want to be spreading misinformation or miss out on something that was important. Post will be up in a few days. Meanwhile, if anyone has queries about anything i did in this post, i’ll respond to those here.

1 Like

Thanks, that’s perfectly understandable !

Sorry for the delay. I’ve been trying to understand why I can’t train a decent model when space is not the first character in the alphabet.txt. I’ve checked everything from text.py, to DS-ctcdecoder’s source and I haven’t found why. The only part left is the _swigwrapper.cpython-36m-x86_64-linux-gnu.so(so file in python library after pip install). Everything else is working as expected except for this. Any inputs?

Can you be more specific ? Do you have anything supporting your theory ?

@lissyx Ok, let me explain step by step.

The first few models I trained had issues where the spaces were missing(the results are discussed in this very thread.). @carlfm01 and I noticed that the first letter(also mentioned in this thread) might have been used for space, so I replaced space as the first character and retrained the model and then the model ran as expected.

We discussed the possibility where I might have trained an initial model with one alphabet.txt and then retrained it with a modified alphabet.txt. I assured that I delete my previous models before training, I also trained 3 models today. One with space at first, second with space in the middle and finally one with space at the beginning just to be sure. The space at beginning works as expected, while the space at the middle position reproduces the same “spaceless” output mentioned in this thread.

To debug, I checked scripts where the alphabets are used.

  1. text.py both list and dictionary used to index str_to_label and vice versa work as expected.

  2. Alphabet in config.py also works as expected.

  3. only part left is ctc beam search decoder batch in evaluate.py. This is from the python module installed using pip. I went to the source code and checked init.py. This calls functions from swigwrapper.py. Now this in turn calls functions from the .so file. This is one I haven’t been able to check.

  4. I’ve also checked Alphabet.h in the native client which shows that it does look for a specific space index, I haven’t been able to see if the index is correct here either.

Well, Alphabet.h is what is used by the CTC decoder, so it would be better there’s no issue here :slight_smile:

swigwrapper is unlikely to have a play here, it’s just code generated for wrapping the C++ code from Python, it’s not even written by us, but by SWIG. It’s generated using native_client/ctcdecode/swigwrapper.i.

@alchemi5t I guess it would be worth filing a github issue now, with your STRs. Do you know if this reproduces everytime, everywhere ? Have you tried with the LDC93S1 sample ?

@lissyx
Here’s the problem reproduced in english(ldc93s1).

First, training model with stock alphabet.txt(space is the first character).

#) 200 epochs 
Test on data/ldc93s1/ldc93s1.csv - WER: 1.000000, CER: 0.846154, loss: 119.647385
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.846154, loss: 119.647385
 - src: "she had your dark suit in greasy wash water all year"
 - res: "he oriana"
--------------------------------------------------------------------------------
#) 400 epochs 
Test on data/ldc93s1/ldc93s1.csv - WER: 0.909091, CER: 0.673077, loss: 71.410652
--------------------------------------------------------------------------------
WER: 0.909091, CER: 0.673077, loss: 71.410652
 - src: "she had your dark suit in greasy wash water all year"
 - res: "had you a swaller"

Second, only changing Alphabet.txt to have space in the middle.(‘a’ is the first character now )

#) 200 epochs

Test on data/ldc93s1/ldc93s1.csv - WER: 1.000000, CER: 0.673077, loss: 117.956841
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.673077, loss: 117.956841
 - src: "she had your dark suit in greasy wash water all year"
 - res: "hearararatisarasaearararar"
--------------------------------------------------------------------------------
#) 400 epochs


--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.538462, loss: 51.921734
 - src: "she had your dark suit in greasy wash water all year"
 - res: "shadyararauieaysashaealyear"
--------------------------------------------------------------------------------

Finally to show the first character was being used as space, I kept ‘x’ as the first character, space is somewhere in the middle and ‘a’ is where ‘x’ used to be.

#) 200 epochs
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.826923, loss: 116.150482
 - src: "she had your dark suit in greasy wash water all year"
 - res: "haxraxianaxar"
--------------------------------------------------------------------------------

#) 400 epochs
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.711538, loss: 54.978935
 - src: "she had your dark suit in greasy wash water all year"
 - res: "haxraxuinxeahxtrxtlyr"
--------------------------------------------------------------------------------

^^ does my latest reply still warrant this? or have i missed something?

Well it seems it’s a legit issue, so worth investigating. Can I ask you to triple-check you are not re-using checkpoint in some way ?

on it. will train again.

@lissyx Fresh models again. Same o/p.

--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.711538, loss: 54.699074
 - src: "she had your dark suit in greasy wash water all year"
 - res: "haxraxuinxeahxtrxtlyr"
--------------------------------------------------------------------------------

This is the only issue left which i was not able to reason at all. If it seems like a legitimate issue, i’ll try and spot the bug. Also, this means my protocol for training might not be flawed as it worked out wonderfully for another indic dataset(albeit it being tiny). so i think i should be able to address everything in my post with instructions for training; except for the alphabet issue.

Yes, please file a bug, with as much as possible details and STR. If you made any change, please document them.

I’ll aggregate the data and file an issue tomorrow.

1 Like

So @alchemi5t I’ve investigated and I did reproduce your issue … Up to the point that I could track down something. When you are changing alphabet.txt file, do you re-generate trie file ?

1 Like

Ah that’s interesting. I did not regenerate the trie file. But for the initial models, I did not use the LM and the trie. But I’ll regenerate the trie and try it again(also I’ll confirm without the lm and tree; for science). I’ll post the results in 8-10 hrs.

Great catch though!

It’s very likely your issue, I did reproduce exactly with your step, and re-generating a new trie with the matching alphabet yields proper results.

1 Like

You’re absolutely right. I think this should be put in the readme, for clarity.

Crazy good catch!