Error while compiling generate_trie.cpp

Nitesh_Tiwari · February 5, 2020, 9:04am

I will checkout that tag and will let you know

Nitesh_Tiwari · February 6, 2020, 8:58am

Now i have matched versions
pip list | grep ds_ctcdecoder
ds-ctcdecoder 0.6.1

git describe --tags
v0.6.1

Now also after generating the LM, I am getting this error while training on checkpoint with my added vocabulary and my LM binary and trie

cmd=>

python3 DeepSpeech.py \

  --train_files /home/Downloads/indian_train.csv \
  --dev_files /home/Downloads/indian_dev.csv \
  --test_files /home/Downloads/indian_test.csv \
  --n_hidden 2048 \
  --train_batch_size 20 \
  --dev_batch_size 10 \
  --test_batch_size 10 \
  --epochs 1 \
  --learning_rate 0.0001 \
  --export_dir /home/Desktop/mark3/trieModel/ \
  --checkpoint_dir /home/Desktop/mark3/DeepSpeech/deepspeech-0.6.1-checkpoint/ \
  --cudnn_checkpoint /home/Desktop/mark3/DeepSpeech/deepspeech-0.6.1-checkpoint/ \
  --alphabet_config_path /home/Desktop/mark3/mfit-models/alphabet.txt \
  --lm_binary_path /home/Desktop/mark3/mfit-models/lm.binary \
  --lm_trie_path /home/Desktop/mark3/mfit-models/trie \

Error during training after it ends doing dev, while checking the test.csv error is coming

I Restored variables from best validation checkpoint at /home/Desktop/mark3/DeepSpeech/deepspeech-0.6.1-checkpoint/best_dev-234353, step 234353
Testing model on /home/Downloads/indian_test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                                                                 Fatal Python error: Fatal Python error: Fatal Python error: Fatal Python error: Segmentation faultSegmentation faultSegmentation fault

Segmentation faultThread 0x

Segmentation fault (core dumped)

lissyx · February 6, 2020, 9:40am

There’s something wrong in your ctc decoder setup / trie production that is broken…

lissyx · February 6, 2020, 9:53am

Can you please tell us exactly how you proceed ? I’m really starting to loose patience here.

lissyx · February 6, 2020, 10:02am

What are the sizes of:

your vocabulary.txt file
your lm.binary file
your trie file

Can you ensure you used exactly the same alphabet file ?

Maybe there’s something bogus in your dataset.

Nitesh_Tiwari · February 6, 2020, 10:16am

vocabulary.txt = 1.7MB
lm.binary = 20.1MB
trie = 80 Bytes

lissyx · February 6, 2020, 10:21am

So you failed at generating the trie file. Since you have not yet shared how you do that, we can’t help you …

Nitesh_Tiwari · February 6, 2020, 10:26am

I have taken the generate_trie from native_client.amd64.cpu.linux.tar.xz and did ./generate_trie ../data/alphabet.txt lm.binary trie to generate the trie

lissyx · February 6, 2020, 10:28am

So that’s not the alphabet you are using for the training ?!

--alphabet_config_path /home/Desktop/mark3/mfit-models/alphabet.txt that does not looks like the same path as ../data/alphabet.txt …

Nitesh_Tiwari · February 6, 2020, 10:30am

i am sorry i used the same i.e. /home/Desktop/mark3/mfit-models/alphabet.txt
did paste other path here sorry

Nitesh_Tiwari · February 6, 2020, 10:32am

Also i checked the size of default lm.binary in ./data/lm its 945MB. Is there some issue in mine with 20MB size ?

lissyx · February 6, 2020, 10:39am

Lm size depends on your vocabulary file size so it might be normal of yours is small

lissyx · February 6, 2020, 10:41am

Please can we avoid constant round trips and get a clear view at once?

Share exact and accurate command line as well as ls for each of the involved files…

Nitesh_Tiwari · February 6, 2020, 11:03am

(mark3) root@computer:/home/computer/Desktop/mark3# ls
customModels  DeepSpeech  indianModel  kenlm  mfit-models  namesModel  tensorflow  trieModel

(mark3) root@computer:/home/computer/Desktop/mark3/mfit-models# ../kenlm/build/bin/lmplz --discount_fallback --text mirrorfit.txt --arpa words.arpa --o 3
=== 1/5 Counting and sorting n-grams ===
Reading /home/computer/Desktop/mark3/mfit-models/mirrorfit.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 200003 types 200006
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:2400072 2:9327010816 3:17488144384
Substituting fallback discounts for order 0: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 1: D1=0.5 D2=1 D3+=1.5
Substituting fallback discounts for order 2: D1=0.5 D2=1 D3+=1.5
Statistics:
1 200006 D1=0.5 D2=1 D3+=1.5
2 400006 D1=0.5 D2=1 D3+=1.5
3 200003 D1=0.5 D2=1 D3+=1.5
Memory estimate for binary LM:
type       kB
probing 17969 assuming -p 1.5
probing 21094 assuming -r models -p 1.5
trie    10718 without quantization
trie     7864 assuming -q 8 -b 8 quantization 
trie    10132 assuming -a 22 array pointer compression
trie     7278 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:2400072 2:6400096 3:4000060
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:2400072 2:6400096 3:4000060
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:26372372 kB	VmRSS:22700 kB	RSSMax:6075336 kB	user:0.876577	sys:1.34088	CPU:2.21748	real:2.16143

(mark3) root@computer:/home/computer/Desktop/mark3/mfit-models# ../kenlm/build/bin/build_binary -T -s words.arpa  lm.binary
Reading words.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS

Below one doesn’t give any output just create trie file

(mark3) root@computer:/home/computer/Desktop/mark3/mfit-models# ../DeepSpeech/generate_trie alphabet.txt lm.binary trie

Deep-speech directory:-

(mark3) root@computer:/home/computer/Desktop/mark3/DeepSpeech# ls
bazel.patch                                                     DeepSpeech.py       libdeepspeech.so                      requirements.txt
bin                                                             doc                 LICENSE                               runNameTrieModel.sh
build-python-wheel.yml-DISABLED_ENABLE_ME_TO_REBUILD_DURING_PR  Dockerfile          myDataset                             stats.py
CODE_OF_CONDUCT.md                                              evaluate.py         native_client                         SUPPORT.rst
CONTRIBUTING.rst                                                evaluate_tflite.py  native_client.amd64.cpu.linux.tar.xz  taskcluster
data                                                            examples            __pycache__                           transcribe.py
deepspeech                                                      generate_trie       README.mozilla                        util
deepspeech-0.6.1-checkpoint                                     GRAPH_VERSION       README.rst                            VERSION
deepspeech-0.6.1-checkpoint.tar.gz                              images              RELEASE.rst
deepspeech.h                                                    ISSUE_TEMPLATE.md   requirements_eval_tflite.txt

Please let me know if I forgot to mention anything.

lissyx · February 6, 2020, 11:10am

ls -hal otherwise it’s useless.

lissyx · February 6, 2020, 11:12am

I still don’t know your alphabet, vocabulary and new trie file sized…

lissyx · February 6, 2020, 11:16am

I’m pretty sure you lack a trie command line parameter here.

Nitesh_Tiwari · February 6, 2020, 11:17am

(mark3) root@computer:/home/computer/Desktop/mark3/DeepSpeech# ls -hal
total 657M
drwxr-xr-x 15 root            root               4.0K Feb  6 11:10 .
drwxrwxr-x 10 computer      computer    4.0K Feb  6 10:36 ..
-rw-r--r--  1 root            root                11K Feb  5 15:51 bazel.patch
drwxr-xr-x  2 root            root               4.0K Feb  5 15:51 bin
-rw-r--r--  1 root            root                173 Feb  5 15:51 build-python-wheel.yml-DISABLED_ENABLE_ME_TO_REBUILD_DURING_PR
-rw-r--r--  1 root            root                 60 Feb  5 15:51 .cardboardlint.yml
-rw-r--r--  1 root            root                691 Feb  5 15:51 CODE_OF_CONDUCT.md
-rwxr-xr-x  1 root            root                933 Feb  5 15:51 .compute
-rw-r--r--  1 root            root               2.1K Feb  5 15:51 CONTRIBUTING.rst
drwxr-xr-x  5 root            root               4.0K Feb  5 15:51 data
-rwxr-xr-x  1 syslog          Unix Group\nogroup 892K Jan 10 22:47 deepspeech
drwxr-xr-x  2             501 staff              4.0K Feb  6 14:17 deepspeech-0.6.1-checkpoint
-rw-rw-r--  1 computer       computer    613M Jan 23 17:35 deepspeech-0.6.1-checkpoint.tar.gz
-rw-r--r--  1 syslog          Unix Group\nogroup 8.4K Jan 10 22:45 deepspeech.h
-rwxr-xr-x  1 root            root                42K Feb  5 15:51 DeepSpeech.py
drwxr-xr-x  3 root            root               4.0K Feb  5 15:51 doc
-rw-r--r--  1 root            root               6.5K Feb  5 15:51 Dockerfile
-rwxr-xr-x  1 root            root               6.8K Feb  5 15:51 evaluate.py
-rw-r--r--  1 root            root               4.6K Feb  5 15:51 evaluate_tflite.py
drwxr-xr-x  2 root            root               4.0K Feb  5 15:51 examples
-r-xr-xr-x  1 syslog          Unix Group\nogroup 2.0M Jan 10 22:47 generate_trie
drwxr-xr-x  9 root            root               4.0K Feb  5 16:47 .git
-rw-r--r--  1 root            root                148 Feb  5 15:51 .gitattributes
drwxr-xr-x  2 root            root               4.0K Feb  5 15:51 .github
-rw-r--r--  1 root            root                474 Feb  5 15:51 .gitignore
-rw-r--r--  1 root            root                123 Feb  5 15:51 .gitmodules
-rw-r--r--  1 root            root                  2 Feb  5 15:51 GRAPH_VERSION
drwxr-xr-x  2 root            root               4.0K Feb  5 15:51 images
-rw-r--r--  1 root            root               1.2K Feb  5 15:51 ISSUE_TEMPLATE.md
-r-xr-xr-x  1 syslog          Unix Group\nogroup  34M Jan 10 22:47 libdeepspeech.so
-rw-r--r--  1 syslog          Unix Group\nogroup  17K Jan 10 22:45 LICENSE
drwxr-xr-x  3 computer computer    4.0K Jan 29 11:11 myDataset
drwxr-xr-x  9 root            root               4.0K Feb  5 15:51 native_client
-rw-rw-r--  1 computer computer    6.5M Feb  6 10:21 native_client.amd64.cpu.linux.tar.xz
drwxr-xr-x  2 root            root               4.0K Feb  6 10:36 __pycache__
-rw-r--r--  1 root            root                18K Feb  5 15:51 .pylintrc
-rw-r--r--  1 syslog          Unix Group\nogroup 1.2K Jan 10 22:45 README.mozilla
-rw-r--r--  1 root            root               5.0K Feb  5 15:51 README.rst
-rw-r--r--  1 root            root                437 Feb  5 15:51 .readthedocs.yml
-rw-r--r--  1 root            root                438 Feb  5 15:51 RELEASE.rst
-rw-r--r--  1 root            root                115 Feb  5 15:51 requirements_eval_tflite.txt
-rw-r--r--  1 root            root                340 Feb  5 15:51 requirements.txt
-rwxr-xr-x  1 computer     computer     869 Feb  6 11:08 runNameTrieModel.sh
-rw-r--r--  1 root            root               1.2K Feb  5 15:51 stats.py
-rw-r--r--  1 root            root               1.6K Feb  5 15:51 SUPPORT.rst
drwxr-xr-x  2 root            root                20K Feb  5 15:51 taskcluster
-rw-r--r--  1 root            root               2.5K Feb  5 15:51 .taskcluster.yml
-rwxr-xr-x  1 root            root               7.6K Feb  5 15:51 transcribe.py
-rw-r--r--  1 root            root                326 Feb  5 15:51 .travis.yml
drwxr-xr-x  3 root            root               4.0K Feb  6 10:36 util
-rw-r--r--  1 root            root                  6 Feb  5 15:51 VERSION

(mark3) root@computer:/home/computer/Desktop/mark3/mfit-models# ls -hal
total 44M
drwxr-xr-x  2 root            root            4.0K Feb  6 16:22 .
drwxrwxr-x 10 computer      computer 4.0K Feb  6 10:36 ..
-rw-r--r--  1 root            root             329 Jan 30 10:30 alphabet.txt
-rw-r--r--  1 root            root             20M Feb  6 16:21 lm.binary
-rw-r--r--  1 root            root            1.7M Jan 31 09:24 mirrorfit.txt
-rw-r--r--  1 root            root              80 Feb  6 16:22 trie
-rw-r--r--  1 root            root             23M Feb  6 16:18 words.arpa

lissyx · February 6, 2020, 11:17am

github.com

mozilla/DeepSpeech/blob/master/data/lm/generate_lm.py#L53


    filtered_path = os.path.join(tmp, 'lm_filtered.arpa')
    vocab_str = '\n'.join(word for word, count in counter.most_common(500000))
    print('Filtering ARPA file...')
    subprocess.run(['filter', 'single', 'model:{}'.format(lm_path), filtered_path], input=vocab_str.encode('utf-8'), check=True)


    # Quantize and produce trie binary.
    print('Building lm.binary...')
    subprocess.check_call([
      'build_binary', '-a', '255',
                      '-q', '8',
                      'trie',
                      filtered_path,
                      'lm.binary'
    ])


if __name__ == '__main__':
  main()

lissyx · February 6, 2020, 11:18am

It’s not like I told you to look at that script…