Libctc_decoder_with_kenlm need with version 0.4.1-0

Hello,

My versions are:

TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-0-g0e40db6

I have trained my own model, but getting confusing results:
Evaluating on one test file (when last epoch is finished) I get decent results, or good enough anyway, but when I do same inference using pythons native_client client.py instead of sentence I get one or two words or empty predictions.

So my questions goes:
Do I need this FLAG to be set in training phase (I didnt see that flag from help menu … ) --decoder_library_path/ libctc_decoder_with_kenlm.so
or do I need that libctc_decoder_with_kenlm.so when I am using pythons native client client.py ?

I searched topics about this subject and read somewhere that poor results from pythons native clients inference is because missing some ctc_decoder … This was older versions I think, but I lost track, so …

Thanks in advance!

EDIT:

When I do:
pip3 install $(python3 util/taskcluster.py --decoder)

I get:

Requirement already satisfied: ds-ctcdecoder==0.5.0a4 from https://index.taskcluster.net/v1/task/project.deepspeech.deepspeech.native_client.v0.5.0-alpha.4.cpu-ctc/artifacts/public/ds_ctcdecoder-0.5.0a4-cp36-cp36m-manylinux1_x86_64.whl in /home/petri/env/lib/python3.6/site-packages

Requirement already satisfied: numpy>=1.7.0 in /home/petri/env/lib/python3.6/site-packages (from ds-ctcdecoder==0.5.0a4)

You didn’t see the flag in the help menu because it does not exist. v0.4.1 doesn’t use libctc_decoder_with_kenlm.so.

Can you provide more information about how you’re using the Python client?

Thanks for your reply.

Well, I have “hard coded” needed arguments directly to code just for test purposes (using same .bin and .tier as in training phase)… That code below gives veery poor results, nothing compared to evaluate phase of training.

I should mention that my training material is 8000hz so I have changed that part in client.py in order not to get my .wavs to upsampled to 16khz … Could this be some ctc_decode issue, versio mismatch etc. No errors are given when I run inference … everything looks fine, except results are junk …

def main():
  #  parser = argparse.ArgumentParser(description='Running DeepSpeech inference.')
  #  parser.add_argument('--model', required=True,
  #                      help='Path to the model (protocol buffer binary file)')
  #  parser.add_argument('--alphabet', required=True,
  #                      help='Path to the configuration file specifying the alphabet used by the network')
  #  parser.add_argument('--lm', nargs='?',
  #                      help='Path to the language model binary file')
  #  parser.add_argument('--trie', nargs='?',
  #                      help='Path to the language model trie file created with native_client/generate_trie')
  #  parser.add_argument('--audio', required=True,
  #                      help='Path to the audio file to run (WAV format)')
  #  parser.add_argument('--version', action=VersionAction,
  #                      help='Print version and exits')
  #  args = parser.parse_args()
 

print('Loading model from file {}'.format('/home/petri/kur/model/output_graph.pbl'), file=sys.stderr)
    model_load_start = timer()
    #ds = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
    ds = Model('/home/petri/kur/model/output_graph.pb', N_FEATURES, N_CONTEXT, '/home/petri/DeepSpeech/alphabet.txt', BEAM_WIDTH)
    model_load_end = timer() - model_load_start
    print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)

    
    print('Loading language model from files {} {}'.format('/home/petri/DeepSpeech/mt_with_d_chats.bin', '/home/petri/DeepSpeech/tier/m_kw_calls_only.tier'), file=sys.stderr)
    lm_load_start = timer()
    ds.enableDecoderWithLM('/home/petri/DeepSpeech/alphabet.txt', '/home/petri/DeepSpeech/mt_with_d_chats.bin', '/home/petri/DeepSpeech/tier/m_kw_calls_only.tier', LM_ALPHA, LM_BETA)
    lm_load_end = timer() - lm_load_start
    print('Loaded language model in {:.3}s.'.format(lm_load_end), file=sys.stderr)
    
    fin = wave.open('/home/petri/Downloads/audiostorage/jv@M_fi.wav', 'rb')
    fs = fin.getframerate()
    if fs != 8000:
        print('Warning: original sample rate ({}) is different than 16kHz. Resampling might produce erratic speech recognition.'.format(fs), file=sys.stderr)
        fs, audio = convert_samplerate('/home/petri/Downloads/audiostorage/jv@M_fi.wav')
    else:
        audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)

    audio_length = fin.getnframes() * (1/8000)
    fin.close()

    print('Running inference.', file=sys.stderr)
    inference_start = timer()
    print(ds.stt(audio, fs))
    inference_end = timer() - inference_start
    print('Inference took %0.3fs for %0.3fs audio file.' % (inference_end, audio_length), file=sys.stderr)

if __name__ == '__main__':
    main()

EDIT:

I have trained model using RTX -GPU but this inference is running on CPU …

If you’re passing the same LM binary and trie files, and the same LM hyperparameters, and the same audio as is used in the evaluation epoch after training, the only remaining variable is the beam width which is 1024 on the training code but 500 on the clients. If you set it to 1024 in the client does it help?

Also, did you check with that specific file in the evaluation phase? For example by creating a simple test CSV that only has one line for /home/petri/Downloads/audiostorage/jv@M_fi.wav in it, and then using that as the test CSV.

Thanks for reply! I think that BEAM width is only 500, so I will update that to 1024, LM hyperparameters, you mean Alpha values etc. ? I will make test csv and try output and post it here. I will use DeepSpeech.py with epoch 1, so it will give that “evaluate” output, which I think is “the best”. Secondly I will use that pythons native client client.py with same file … So we can compare.

But this ds-ctcdecoder==0.5.0a4 library/file is automatically used, so I dont have to check its working properly or anything else ?

The ds_ctcdecoder module is automatically used, but you need to make sure you’re using the same version as your version of DeepSpeech. v0.5.0a4 is newer than v0.4.1 but IIRC it is compatible. Try downgrading to v0.4.1 and see if that changes things.

Hello, here comes results from pythons client.py and evaluate after last epoch of training:

Client.py
Loading model from file /home/petri/kur/model/output_graph.pbl
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-0-g0e40db6
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-05-18 13:15:04.170894: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Loaded model in 0.00895s.
Loading language model from files m_zero_and_one_stuff_bigram.bin /home/petri/DeepSpeech/tier/m_only_one_and_zero.tier
Loaded language model in 0.000838s.
Running inference.
saako sen 
Inference took 0.182s for 11.730s audio file.


RESULTS from Evaluate after last epoch: 



100% (1 of 1) |###############################################################################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
Decoding predictions...
100% (1 of 1) |###############################################################################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
Test - WER: 0.846154, CER: 0.430052, loss: 314.121490
--------------------------------------------------------------------------------
WER: 0.846154, CER: 83.000000, loss: 314.121490
 - src: "no niin tarviis viela perua nii tana iltana kymmeneen mennessa ooksa muuten missa vaiheessa kuullut tost meidan autonhuolto kampanjasta joka on nyt meneillaan sataseitseman euroa tarkastus"
 - res: "niin jos nyt tarvii niin taa hinta ne ruut siina on mutta missa vaiheessa kun niin autonhuolto kampanja seka nyt menee ihan et eka euroa tarkastus "
--------------------------------------------------------------------------------
Exporting..
I Exporting the model...
WARNING:tensorflow:From /home/petri/env/lib/python3.6/site-packages/tensorflow/python/tools/freeze_graph.py:232: convert_variables_to_constants (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.convert_variables_to_constants
WARNING:tensorflow:From /home/petri/env/lib/python3.6/site-packages/tensorflow/python/framework/graph_util_impl.py:245: extract_sub_graph (from tensorflow.python.framework.graph_util_impl) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.compat.v1.graph_util.extract_sub_graph
I Models exported at /home/petri/kur/model/

So, same wav file, but huuge difference. I tried to downgrade this ds_decoder giving --branch flag, but no luck … only 0.5 (what I am using here in these examples) is found…

LM parameters are:
BEAM_WIDTH = 1024
LM_ALPHA = 0.75
LM_BETA = 1.85

and (Well, below are some audio features)

N_FEATURES = 26
N_CONTEXT = 9

Any ideas where that huge difference might come from ?

Can you share the full command lines you used for training, evaluating and exporting the model, as well as the full command lines you used for inference with the client?

Hello, I deleted my Deepspeech -folder (which was that 0.5 something version I mentioned) and download DeepSpeech v0.4.1-0-g0e40db6 and prebuilt binaries. I remake TIER -file and trained my model again. Could this generate_tier -be version dependant … ?

Still evaluate after last epoch is giving the best inference. Results are now closer however:

Deepspeech -executable gives second best inference to same .wav

Pythons client.py (well, this client.py is from previous installations) inference is missing some words what I got from evaluate …

My best guess what it causing this is some miscompatible libraries in decoding after all I didnt delete virtualenv, just deleted deepspeech -folder …

I will post you what you asked…

Thanks in advance!

This is my training command line:

#!/bin/sh
set -xe
if [ ! -f DeepSpeech.py ]; then
    echo "Please make sure you run this from DeepSpeech's top level directory."
    exit 1
fi;

python -u DeepSpeech.py \
  --train_files /home/petri/kur/data/audio/meh_and_dana_kw_and_zero_m_calls.csv \
  --dev_files /home/petri/DeepSpeech-0.4.1/dev_meh.csv \
  --test_files /home/petri/DeepSpeech-0.4.1/test_meh.csv \
  --train_batch_size 70 \
  --dev_batch_size 1 \
  --test_batch_size 1 \
  --n_hidden 375 \
  --epoch 70 \
  --validation_step 3 \
  --early_stop False \
  --earlystop_nsteps 6 \
  --estop_mean_thresh 0.2 \
  --estop_std_thresh 0.2 \
  --dropout_rate 0.22 \
  --learning_rate 0.00098 \
  --report_count 200 \
  --use_seq_length False \
  --export_dir /home/petri/DeepSpeech-0.4.1/ac_models/ \
  --checkpoint_dir /home/petri/DeepSpeech-0.4.1/m_and_dana_checkpoint/ \
  --alphabet_config_path /home/petri/DeepSpeech-0.4.1/alphabet.txt \
  --lm_binary_path /home/petri/DeepSpeech-0.4.1/LM_models/meh_zero_and_one_stuff_bigram.bin \
  --lm_trie_path /home/petri/DeepSpeech-0.4.1/tier/trie \
  "$@"

This is my deepspeech commandline after model is done and I test it to same file:

deepspeech --model ac_models/output_graph.pb --alphabet alphabet.txt --lm LM_models/meh_zero_and_one_stuff_bigram.bin --trie tier/trie --audio /home/petri/Downloads/onlylastseven_huhtikuu/at_chunk-28.wav

and that gives this error before outputting inference (just to show few of them):

2019-05-22 17:52:54.475540: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:855] BlockLSTMOp is inefficient when both batch_size and cell_size are odd. You are using: batch_size=1, cell_size=375

2019-05-22 17:52:54.477728: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:850] BlockLSTMOp is inefficient when both batch_size and input_size are odd. You are using: batch_size=1, input_size=375

2019-05-22 17:52:54.477746: W tensorflow/contrib/rnn/kernels/lstm_ops.cc:855] BlockLSTMOp is inefficient when both batch_size and cell_size are odd. You are using: batch_size=1, cell_size=375

and my client.py is:

**#!/usr/bin/env python**

**# -*- coding: utf-8 -*-**

**from** __future__ **import** absolute_import, division, print_function

**import** argparse

**import** numpy **as** np

**import** shlex

**import** subprocess

**import** sys

**import** wave

**from** deepspeech **import** Model, printVersions

**from** timeit **import** default_timer **as** timer

**try** :

**from** shhlex **import** quote

**except** ImportError:

**from** pipes **import** quote

**# These constants control the beam search decoder**

**# Beam width used in the CTC decoder when building candidate transcriptions**

BEAM_WIDTH = 1024

**# The alpha hyperparameter of the CTC decoder. Language Model weight**

**#LM_ALPHA = 0.0**

LM_ALPHA =0.85

**# The beta hyperparameter of the CTC decoder. Word insertion bonus.**

LM_BETA = 1.85

**#LM_BETA = 400**

**# These constants are tied to the shape of the graph used (changing them changes**
**# the geometry of the first layer), so make sure you use the same constants that**

**# were used during training**

**# Number of MFCC features to use**

N_FEATURES = 26

**# Size of the context window used for producing timesteps in the input vector**

N_CONTEXT = 9

**def** **convert_samplerate** (audio_path):

sox_cmd = **'sox {} --type raw --bits 16 --channels 1 --rate 16000 --encoding signed-integer --endian little --compression 0.0 --no-dither - '** .format(quote(audio_path))

**try** :

output = subprocess.check_output(shlex.split(sox_cmd), stderr=subprocess.PIPE)

**except** subprocess.CalledProcessError **as** e:

**raise** RuntimeError( **'SoX returned non-zero status: {}'** .format(e.stderr))

**except** OSError **as** e:

**raise** OSError(e.errno, **'SoX not found, use 16kHz files or install it: {}'** .format(e.strerror))

**return** 16000, np.frombuffer(output, np.int16)

**class** VersionAction(argparse.Action):

**def** **__init__** (self, *args, **kwargs):

super(VersionAction, self).__init__(nargs=0, *args, **kwargs)

**def** **__call__** (self, *args, **kwargs):

    def __call__(self, *args, **kwargs):
        printVersions()
        exit(0)

def main():
  #  parser = argparse.ArgumentParser(description='Running DeepSpeech inference.')
  #  parser.add_argument('--model', required=True,
  #                      help='Path to the model (protocol buffer binary file)')
  #  parser.add_argument('--alphabet', required=True,
  #                      help='Path to the configuration file specifying the alphabet used by the network')
  #  parser.add_argument('--lm', nargs='?',
  #                      help='Path to the language model binary file')
  #  parser.add_argument('--trie', nargs='?',
  #                      help='Path to the language model trie file created with native_client/generate_trie')
  #  parser.add_argument('--audio', required=True,
  #                      help='Path to the audio file to run (WAV format)')
  #  parser.add_argument('--version', action=VersionAction,
  #                      help='Print version and exits')
  #  args = parser.parse_args()

    #print('Loading model from file {}'.format('/home/petri/kur/model/output_graph.pbl'), file=sys.stderr)
    model_load_start = timer()
    #ds = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
    ds = Model('ac_models/output_graph.pb', N_FEATURES, N_CONTEXT, 'alphabet.txt', BEAM_WIDTH)
  model_load_end = timer() - model_load_start
    print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)

    
    lm_load_start = timer()     
    ds.enableDecoderWithLM('alphabet.txt','LM_models/meh_zero_and_one_stuff_bigram.bin', 'tier/trie',LM_ALPHA, LM_BETA)
                                           
    lm_load_end = timer() - lm_load_start
    print('Loaded language model in {:.3}s.'.format(lm_load_end), file=sys.stderr)
    
    fin = wave.open('/home/petri/Downloads/onlylastsevenanne_huhtikuu/at_chunk-28.wav', 'rb')
    fs = fin.getframerate()
    if fs != 16000:
        print('Warning: original sample rate ({}) is different than 16kHz. Resampling might produce erratic speech recognition.'.format(fs), file=sys.stderr)
        fs, audio = convert_samplerate('/home/petri/Downloads/onlylastsevenanne_huhtikuu/at_chunk-28.wav')
  
    else:
        audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)

    audio_length = fin.getnframes() * (1/16000)
    fin.close()

    print('Running inference.', file=sys.stderr)
inference_start = timer()

**print** (ds.stt(audio, fs))

inference_end = timer() - inference_start

**print** ( **'Inference took %0.3fs for %0.3fs audio file.'** % (inference_end, audio_length), file=sys.stderr)

**if** __name__ == **'__main__'** :

main()

So, three ways to do inference, same model, same LM, same TIER, and three versions of inference from same wav …

FWIW the language model files are a binary and a trie, not tier.

I would recommend removing the --use_seq_length False parameter, although I think that syntax is not actually being picked up by the arg parser, but it wouldn’t hurt to make sure. Other than that, the beam width difference between clients and evaluate.py, the only other reason I could think of are different versions of the training code vs clients. You mentioned you already double checked that everything is 0.4.1, so I would check the other things I mentioned.

Do you recommend me to delete my virtualenv env -folder, remake fresh virtual env and this way make sure I dont have any misbehaving libraries causing this ? Or is it easier to manually check critical files which are located in native_client dir and … ?

Yes, creating a virtualenv from scratch is probably the safest approach.

Ok, will do, BTW: Do you know if people have made their “own decoders” from editing evaluate.py and run that piece of code directly and even shared it somewhere ? Thats my plan E if everything else seems to fail …

No, I haven’t seen that.

Hello, again!

I did that new virtual environment and used:
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-0-g0e40db6

I uninstalled CPU version of TF and installed GPU version.

I took away that seq -lenght parameter.

Evaluate gives following result:

Decoding predictions...
100% (1 of 1) |####################################################################################################################################################################| Elapsed Time: 0:00:00 Time:  0:00:00
Test - WER: 0.923077, CER: 83.000000, loss: 266.225098
--------------------------------------------------------------------------------
WER: 0.923077, CER: 83.000000, loss: 266.225098
 - src: "no niin tarviis viela perua nii tana kymmeneen mennessa ooksa muuten missa vaiheessa kuullut tost meidan autotarkastus kampanjasta joka on nyt meneillaan satanelkytyhdeksan euroa tarkastus"
 - res: "niin jos nyt tarvii siel perua niita nain tan kymmenes peruutus viimeista kaksi onks mutta missa vaiheessa niin autotarkastus kampanja saan menee ihan ne satayhdeksan euron tarkastus "
--------------------------------------------------------------------------------
I Exporting the model...
I Models exported at ac_models/

But when I use deepspeech executable I get -nothing-!

Same thing if I use client.py, I get -nothing-!

What is causing this ? My training data is 8000hz, my testdata is 8000hz. That pythons client.py upsample waw to 16000, no results, but when I did bypass to that and kept sample rate to 8000hz … no effect, same blank inference …

Its getting strange, so solution must be simple, but I just cant see it …

EDIT:

Returning blank, or empty line was because of missing alphabet.txt. Fixing that I started to get some words, but we are back in same problem: It doesnt give same inference as in evaluate phase after training is over. BEAM_WIDTH is 1024 and still … missing over half of the words.

I have modified evaluate.py -code in a way I can now use that to do inference same accuracy as evaluate phase after last epoch.

However, I know thats not the right way to do it. I really would like to know why I am not able to get same results using deepspeech binary, or Pythons client code ?

In that evaluate.py code I use same LM and TRIE as in training (I use FLAGS - to point right LM and TRIE). I use those same LM and TRIE in client.py and give those as arguments to deepseech binary but still getting whole different results. I have a theory why this is happening: Evaluate.py uses last(?) checkpoint, but deepspeech binary and client.py uses exported model. Could this be the answer and if yes, what is happening in exporting phase ?

Thanks, have to take a look at it.

Do you have some guidelines for me to check what could be reason why I am getting difference results from same wav depending on what method I am using ? (Deepspeech bin, Pythons client.py vs. evaluate phase in training, which give me the best result after last training epoch … )

Thanks in advance.

No, because I have still not been able to properly understand your issue … There are too many variables in play that may explain the differences,