ValueError

zhangpeng_K · March 16, 2020, 9:45am

thanks for your reply and i will try again later!

Stanislavs_Davidovics · March 18, 2020, 5:40pm

zhangpeng_K hello, I am still dealing with alphabet.txt issue and import_CV2.PY, As soon I overcome I will tell if I manage to overcome this issue

zhangpeng_K · March 19, 2020, 3:55am

firstly , as @lissyx said, you must ensure your ds version and and ds_ctcdecoder are same, and then ,if you add some new alphbet into alphabet.txt , you must have to rebuild wors.arpa 、lm.binary with kenlm tool ,and then use the generate_trie to generate trie binary , finally update the train shell (the path of the files you have updated). following is my environment:
(deepspeech-venv) (base) zhangp@zhangp:~/tmp/deepspeech-venv$ pip list
Package Version

absl-py 0.9.0
astor 0.8.1
attrdict 2.0.1
audioread 2.1.8
beautifulsoup4 4.8.2
bs4 0.0.1
certifi 2019.11.28
cffi 1.14.0
chardet 3.0.4
decorator 4.4.2
deepspeech-gpu 0.6.1
ds-ctcdecoder 0.6.1
gast 0.2.2
google-pasta 0.1.8
grpcio 1.27.2
h5py 2.10.0
idna 2.9
joblib 0.14.1
Keras-Applications 1.0.8
Keras-Preprocessing 1.1.0
librosa 0.7.2
llvmlite 0.31.0
Markdown 3.2.1
mock 4.0.2
numba 0.48.0
numpy 1.18.1
opt-einsum 3.1.0
pandas 1.0.1
pip 20.0.2
pkg-resources 0.0.0
progressbar2 3.47.0
protobuf 3.11.3
pycparser 2.19
python-dateutil 2.8.1
python-utils 2.3.0
pytz 2019.3
pyxdg 0.26
requests 2.23.0
resampy 0.2.2
scikit-learn 0.22.2
scipy 1.4.1
semver 2.9.1
setuptools 45.2.0
six 1.14.0
SoundFile 0.10.3.post1
soupsieve 2.0
sox 1.3.7
tensorboard 1.15.0
tensorflow-estimator 1.15.1
tensorflow-gpu 1.15.0
termcolor 1.1.0
urllib3 1.25.8
webrtcvad 2.0.10
Werkzeug 1.0.0
wheel 0.34.2
wrapt 1.12.0

zhangpeng_K · March 19, 2020, 4:00am

hello @Stanislavs_Davidovics: i dont use import_CV2.py to resolve alphabet.txt, i just rely the tutorial , and zhen write my own code to generate alphabet.txt samed with the offical, now i can train my own model ,if you have some experiences ,we can communicate any time ,thanks!

pj123 · March 19, 2020, 4:27am

Finally solved, by re building language model .arpa and .binary and trie with modified alphabet file. and then used newly built LM .binary and trie to train model.
It worked. Thanks.
*I have used DeepSpeech 0.6.1 and setup kenlm build using python.

CheahHeng_Tan · March 19, 2020, 8:39am

Hi lissyx, I need more detail clarification about your above reply

Can you point me to the document you mention above
When you say “need to retrain from scratch” what is it that need to be retrained from scratch? And how to go about this.

Thanks

CheahHeng_Tan · March 19, 2020, 8:45am

Hi Zhangpeng_K,

Can you explain the following:

How to rebuild wors.arpa and lm.binary with kenlm tool?
Why do you need to rebuild wors.arpa and im.binary?
Why do you still need to generate trie binary when you have the kenlm.scorer package?

Thanks and appreciate your insights!

lissyx · March 19, 2020, 9:03am

https://deepspeech.readthedocs.io/en/v0.6.1/TRAINING.html

lissyx · March 19, 2020, 9:03am

Same, this is all in TRAINING doc and data/lm

CheahHeng_Tan · March 19, 2020, 9:08am

I went through DeepSpeech’s Training your own model and also data/lm’s readme files many times but it doesn’t give any explanation as to why you need to build .arpa and .binary. The DeepSpeech’s training section is also quiet about this part. How do you link those 2 pieces of guide together to provide a better understanding on what’s going on when you have new alphabet to add in?

lissyx · March 19, 2020, 9:09am

Decoding requires knowing the alphabet. Your vocabulary gets translated into the lm.binary file. So changing the alphabet will mean your lm.binary is invalid.

Stanislavs_Davidovics · March 19, 2020, 9:12am

Where exactly it is mentioned inhttps://deepspeech.readthedocs.io/en/v0.6.1/TRAINING.html , that you need create new words.arpa and lm.binary.

lissyx · March 19, 2020, 9:11am

Can we avoid getting the topic in three different directions here ? Rebuilding the LM mentions generating the trie file.

PLEASE PLEASE PLEASE IF YOU FIND THE DOC UNACCURATE, FILE ISSUES AND EXPLAIN WHAT YOU DON’T UNDERSTAND / MISS.

WE CAN’T GET INTO YOUR HEAD.

Stanislavs_Davidovics · March 19, 2020, 9:17am

Agree. I edited my answer.

When I am reading https://deepspeech.readthedocs.io/en/v0.6.1/TRAINING.html I could not find reference to words.arpa and lm.binary, maybe this should be added, because I found about it on different web pages.

lissyx · March 19, 2020, 9:20am

Well, it is obvious to us and to many other people who have not had any problem doing their training that you need to build your own language model. So please file issue on Github explaining exactly what you miss or how you would word it. Better even if you send a patch adding the missing documentation.

Let me repeat: we cannot get into your head. We are deep down into the project, some things that are so obvious to us we can’t even know why it’s complicated to others. This is no arrogance or pushback.

If people don’t tell us that they don’t understand the current doc or else, we can’t improve.

CheahHeng_Tan · March 19, 2020, 9:22am

Hi lissyx,

Does lm.binary applies to any language?
Reason I’m asking is I’m working on a zh-HK and id language. How would running python generate_lm.py distinguish which language that I’m working on?

Thanks

lissyx · March 19, 2020, 9:29am

Yes. Please look at the doc explaining acoutic / language model. You need LM to perform decoding, whatever the language.

That question makes no sense to me. This code does not care about your language, it will just build a file that is being used by the decoder to help decoding acoustic output.

CheahHeng_Tan · March 19, 2020, 9:36am

Where is this document on acoustic/language model?

And thanks for the reply on

lissyx · March 19, 2020, 9:39am

it’s everywhere … in the original paper etc.

lissyx · March 19, 2020, 9:46am

@CheahHeng_Tan It seems you are searching for a lot of links / references, can you please explain what you are working on ?