Training fail with error about characters

Hi, I’m got this error when nearly finish my training job:


Decoding predictions…

[ctc_beam_search_decoder.cpp:[ctc_beam_search_decoder.cpp:29] FATAL[ctc_beam_search_decoder.cpp:29] FATAL29] FATAL: [ctc_beam_search_decoder.cpp:29] ": [ctc_beam_search_decoder.cpp:29] β€œ(class_dim) == (alphabet.GetSize()+1)” check failed. FATAL[ctc_beam_search_decoder.cpp:29] FATAL: : (class_dim) == (alphabet.GetSize()+1)[ctc_beam_search_decoder.cpp:29] " check failed. The shape of probs does not match with the shape of the vocabularyFATAL

The shape of probs does not match with the shape of the vocabulary: [ctc_beam_search_decoder.cpp:"29] β€œβ€

(class_dim) == (alphabet.GetSize()+1): FATAL(class_dim) == (alphabet.GetSize()+1)FATAL"" check failed. " check failed. : β€œ: (class_dim) == (alphabet.GetSize()+1)(class_dim) == (alphabet.GetSize()+1)” check failed. The shape of probs does not match with the shape of the vocabulary

" check failed. (class_dim) == (alphabet.GetSize()+1)The shape of probs does not match with the shape of the vocabulary"The shape of probs does not match with the shape of the vocabularyThe shape of probs does not match with the shape of the vocabulary

Please help me to resolve this error.

The error message is pretty obvious. Please share more details on your training.

This is list of characters in my alphabet.txt (already contain the whitespace char ’ '):
a
Γ‘
Γ 
αΊ£
Γ£
αΊ‘
Δƒ
αΊ―
αΊ±
αΊ³
αΊ΅
αΊ·
Γ’
αΊ₯
αΊ§
αΊ©
αΊ«
αΊ­
b
c
d
Δ‘
e
Γ©
Γ¨
αΊ»
αΊ½
αΊΉ
Γͺ
αΊΏ
ề
ể
α»…
ệ
g
h
i
Γ­
Γ¬
ỉ
Δ©
α»‹
k
l
m
n
o
Γ³
Γ²
Γ΅
ỏ
ọ
Γ΄
α»‘
α»“
α»•
α»—
α»™
Ζ‘
α»›
ờ
ở
α»‘
ợ
p
q
r
s
t
u
ΓΊ
ΓΉ
ủ
Ε©
α»₯
Ζ°
α»©
α»«
α»­
α»―
α»±
v
x
y
Γ½
α»³
α»·
α»Ή
α»΅

And this is all alphabets in 3 dataset I get when check with util/check_characters.py
train.csv:
[β€˜Ζ‘β€™, β€˜α»…β€™, β€˜α»ƒβ€™, β€˜Γ¨β€™, β€˜y’, β€˜αΊ·β€™, β€˜Ζ°β€™, β€˜α»©β€™, β€˜αΊ§β€™, β€˜αΊ«β€™, β€˜αΊ©β€™, β€˜b’, β€˜u’, β€˜α»β€™, β€˜Γ’β€™, β€˜m’, β€˜αΊ³β€™, β€˜α»β€™, β€˜αΊΏβ€™, β€˜α»‘β€™, β€˜α»±β€™, β€˜c’, β€˜α»—β€™, β€˜α»•β€™, β€˜α»­β€™, β€˜α»›β€™, β€˜Γ­β€™, β€˜t’, β€˜αΊ£β€™, β€˜α»β€™, β€˜αΊ΅β€™, β€˜Δ‘β€™, β€˜α»·β€™, β€˜α»«β€™, β€˜α»™β€™, β€˜α»‹β€™, β€˜αΊΉβ€™, β€˜α»‰β€™, β€˜α»Ÿβ€™, β€˜g’, β€˜α»£β€™, β€˜ΓΉβ€™, β€˜α»“β€™, β€˜α»§β€™, β€˜Γ³β€™, β€˜αΊ―β€™, β€˜Γ½β€™, β€˜Δƒβ€™, ’ ', β€˜e’, β€˜a’, β€˜Γ£β€™, β€˜α»‡β€™, β€˜Γ²β€™, β€˜n’, β€˜αΊ»β€™, β€˜αΊ½β€™, β€˜Γͺ’, β€˜Γ¬β€™, β€˜i’, β€˜αΊ±β€™, β€˜αΊ₯’, β€˜αΊ‘β€™, β€˜r’, β€˜α»―β€™, β€˜k’, β€˜α»Ήβ€™, β€˜Ε©β€™, β€˜Γ‘β€™, β€˜d’, β€˜α»³β€™, β€˜αΊ­β€™, β€˜o’, β€˜s’, β€˜α»β€™, β€˜Γ΄β€™, β€˜p’, β€˜Δ©β€™, β€˜Γ β€™, β€˜Γ΅β€™, β€˜h’, β€˜q’, β€˜α»₯’, β€˜α»΅β€™, β€˜Γ©β€™, β€˜v’, β€˜ΓΊβ€™, β€˜x’, β€˜l’, β€˜α»‘β€™]

dev.csv:
[β€˜x’, β€˜v’, β€˜Δƒβ€™, β€˜α»΅β€™, β€˜c’, β€˜αΊ₯’, β€˜αΊ³β€™, β€˜h’, β€˜αΊ­β€™, β€˜α»‘β€™, β€˜αΊ©β€™, β€˜α»—β€™, β€˜α»β€™, β€˜αΊ½β€™, β€˜Γ©β€™, β€˜α»©β€™, β€˜αΊ§β€™, β€˜s’, β€˜Ζ°β€™, β€˜αΊ±β€™, β€˜αΊ΅β€™, β€˜Γ΄β€™, β€˜α»­β€™, β€˜ΓΉβ€™, β€˜Ε©β€™, β€˜α»―β€™, β€˜Δ©β€™, β€˜r’, β€˜αΊ·β€™, β€˜g’, β€˜αΊ―β€™, β€˜α»±β€™, ’ ', β€˜p’, β€˜α»§β€™, β€˜α»‘β€™, β€˜y’, β€˜Γͺ’, β€˜b’, β€˜α»«β€™, β€˜α»β€™, β€˜Γ β€™, β€˜αΊΏβ€™, β€˜α»™β€™, β€˜α»Ÿβ€™, β€˜α»›β€™, β€˜q’, β€˜Γ­β€™, β€˜α»…β€™, β€˜αΊ‘β€™, β€˜α»β€™, β€˜α»Ήβ€™, β€˜d’, β€˜Γ’β€™, β€˜Γ¬β€™, β€˜α»·β€™, β€˜α»‡β€™, β€˜Γ²β€™, β€˜l’, β€˜Γ¨β€™, β€˜o’, β€˜Γ³β€™, β€˜Δ‘β€™, β€˜α»“β€™, β€˜Γ‘β€™, β€˜t’, β€˜u’, β€˜ΓΊβ€™, β€˜α»‹β€™, β€˜α»³β€™, β€˜α»ƒβ€™, β€˜αΊΉβ€™, β€˜αΊ«β€™, β€˜αΊ»β€™, β€˜α»£β€™, β€˜α»‰β€™, β€˜αΊ£β€™, β€˜Ζ‘β€™, β€˜m’, β€˜n’, β€˜Γ£β€™, β€˜α»•β€™, β€˜α»₯’, β€˜k’, β€˜Γ½β€™, β€˜e’, β€˜α»β€™, β€˜Γ΅β€™, β€˜i’, β€˜a’]

test.csv:
[β€˜αΊΏβ€™, β€˜Δ‘β€™, β€˜α»•β€™, β€˜Δ©β€™, β€˜l’, β€˜αΊ£β€™, β€˜αΊ­β€™, β€˜αΊΉβ€™, β€˜αΊ‘β€™, β€˜Γ½β€™, β€˜α»«β€™, β€˜α»“β€™, β€˜αΊ«β€™, β€˜αΊ©β€™, β€˜α»—β€™, β€˜t’, β€˜αΊ·β€™, β€˜αΊ₯’, β€˜α»Ÿβ€™, β€˜p’, β€˜αΊ½β€™, β€˜Γ­β€™, β€˜α»±β€™, β€˜x’, β€˜αΊ±β€™, β€˜s’, β€˜α»‘β€™, β€˜b’, β€˜α»Ήβ€™, β€˜a’, β€˜α»­β€™, β€˜α»§β€™, β€˜y’, β€˜αΊ»β€™, β€˜α»‹β€™, β€˜α»β€™, β€˜α»β€™, β€˜e’, β€˜Γ‘β€™, β€˜α»β€™, β€˜α»©β€™, β€˜αΊ―β€™, β€˜α»―β€™, β€˜k’, β€˜Γ’β€™, β€˜Ζ°β€™, β€˜n’, β€˜Γ΅β€™, β€˜Γ©β€™, β€˜α»™β€™, β€˜m’, β€˜α»·β€™, β€˜d’, β€˜α»‰β€™, β€˜r’, β€˜Γ΄β€™, β€˜α»£β€™, ’ ', β€˜c’, β€˜v’, β€˜Γ£β€™, β€˜α»‘β€™, β€˜Ζ‘β€™, β€˜u’, β€˜Ε©β€™, β€˜ΓΉβ€™, β€˜g’, β€˜α»₯’, β€˜Γ³β€™, β€˜Γ¨β€™, β€˜α»›β€™, β€˜Δƒβ€™, β€˜Γͺ’, β€˜o’, β€˜α»…β€™, β€˜α»³β€™, β€˜Γ¬β€™, β€˜Γ²β€™, β€˜αΊ³β€™, β€˜αΊ§β€™, β€˜α»β€™, β€˜ΓΊβ€™, β€˜q’, β€˜i’, β€˜h’, β€˜Γ β€™, β€˜α»ƒβ€™, β€˜α»‡β€™]

@lissyx If you need anythings eles in my training to resolve this just tell me :smiley:

1 Like

Well, have you checked the dimensions ?

I don’t know how to check it, can you help me :smiley:

The shape of probs does not match with the shape of the vocabulary

So you need to check the size of your output layer with the size of your alphabet.

So, again, please explain more what you are doing. What’s this model, training dataset, language model, etc etc.

Thank you @lissyx, I have resolved this error by create new alphabet.txt file. The last one have somethings error in there where I only have 90 characters but when I print the alphabet in evaluate.py in decoding step its was 91.