Hello
I am doing some tests with a portuguese Deep Speech which I trained.
I made a huge vocabulary.txt, inspired by the english version. But my .txt has some symbols too. So I got the error:
Converting to lowercase and counting word occurrences ...
| | # | 23838383 Elapsed Time: 0:14:29
Saving top 500000 words ...
Calculating word statistics ...
Your text file has 603695917 words in total
It has 7083721 unique words
Your top-500000 words are 97.6812 percent of all words
Your most common word "de" occurred 29191195 times
The least common word in your top-k is "super-heroínas;" with 16 times
The first word with 17 occurrences is "✤" at place 488285
Creating ARPA file ...
=== 1/5 Counting and sorting n-grams ===
Reading /content/sample_data/DeepSpeech/data/lm/lower.txt.gz
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
tcmalloc: large alloc 2043600896 bytes == 0x56449cc22000 @ 0x7faf03e611e7 0x56449a48f7a2 0x56449a42a51e 0x56449a4092eb 0x56449a3f5066 0x7faf01ffabf7 0x56449a3f6baa
tcmalloc: large alloc 9536798720 bytes == 0x564516910000 @ 0x7faf03e611e7 0x56449a48f7a2 0x56449a47e7ca 0x56449a47f208 0x56449a409308 0x56449a3f5066 0x7faf01ffabf7 0x56449a3f6baa
**********/content/sample_data/DeepSpeech/kenlm/lm/builder/corpus_count.cc:179 in void lm::builder::{anonymous}::ComplainDisallowed(StringPiece, lm::WarningAction&) threw FormatLoadException.
Special word <s> is not allowed in the corpus. I plan to support models containing <unk> in the future. Pass --skip_symbols to convert these symbols to whitespace.
Traceback (most recent call last):
File "generate_lm.py", line 210, in <module>
main()
File "generate_lm.py", line 201, in main
build_lm(args, data_lower, vocab_str)
File "generate_lm.py", line 97, in build_lm
subprocess.check_call(subargs)
File "/usr/lib/python3.6/subprocess.py", line 311, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['/content/sample_data/DeepSpeech/kenlm/cmake/bin/lmplz', '--order', '5', '--temp_prefix', '.', '--memory', '85%', '--text', './lower.txt.gz', '--arpa', './lm.arpa', '--prune', '0', '0', '1', '--discount_fallback']' died with <Signals.SIGABRT: 6>.
So I tried to write --skip_symbols as a flag for generate_lm, and it seems it doens’t allow this flag. How can I deal with this?
Skipping the symbols could help me a lot.