About Chinese Pinyin model

Is there someone experienced using pinyin dataset to train a model, like THCHS-30(https://www.openslr.org/18/), I have a problem with the alphabet set and language model.
I have tried with the officially released alphabet set, and process the format like below picture


but it does not work. Always gives the feedback like . The transcript of the A22_50.wav is like /DeepSpeech/cn_corpus/THCHS_30/data_thchs30/train/ **A22_50.wav** ,144044,ii i1 x iang1 g ang3 d e5 k ang4 r iz4 j iu4 uu uang2 vv vn4 d ong4 ii i3 j iu4 uu uang2 uu un2 h ua4 vv vn4 d ong4 uu ui2 zh u3 ii iao4 x ing2 sh ix4 and it does not contain the single ‘i’.
Does anybody encounter problem like this?

Hi,

Is your problem solved? My assumption is because the file itself. Which operation system do you use? if it is linux, I think you will have to change the line endings to ‘Unix’ (you can use sublime text to do that easily).

BTW, if your problem is solved. How’s the performance? I am preparing to train one with pinyin as input as well.

This is not how it works.
Tracing code down to “alphabet.cc”, you can refer to function “CanEncode()” to see how it is designed.

alphabets.txt shouldn’t contain more than one unicode code point each line