Libctc_decoder_with_kenlm need with version 0.4.1-0

pete · May 30, 2019, 1:57pm

I have just commented out that sox -line without changing parameters …

lissyx · May 30, 2019, 1:57pm

Yeah but you are adding more variables to the problem when we need less. Honestly, at that point in the thread, it’s completely impossible to get a clear picture of what you train and how, what you run and how.

You have some training but it’s with a loss so high I think your model is just a good random number generator.

pete · May 30, 2019, 2:05pm

Well, I have kept that test -wav in 8000hz, I tried once to convert it to 16000 could that explain results, but it didnt, so 8000 it is.

Pythons client.py is giving same result all the time. Deepspeech binary gives same result all the time, but it differs from client.py.

Could this be GPU - CPU problem ? Training is happening on GPU but predictions happens on CPU when you try your new model … Shouldnt be, but just shooting everything.

reuben · May 30, 2019, 2:05pm

No, the problem is the resampling. Don’t use the deepspeech binary with a model trained on 8kHz data. It won’t work.

pete · May 30, 2019, 2:08pm

Ok, I wont use deepspeech binary. So we have this Pythons client.py … is that version dependent or just those c-coded parts it uses to decode …

lissyx · May 30, 2019, 2:11pm

Sorry, what’s the question here?

pete · May 30, 2019, 2:12pm

Question is: That python coded client.py, that should work with not just Deepspeech 0.4.1 but other DS versions as well ?

lissyx · May 30, 2019, 2:37pm

client.py is also the filename used by the deepspeech python binary, so that’s a bit confusing, but I’m going to assume you are referring to your code.

Yes, as long as the API is compatible. You need a model that is also compatible with the DeepSpeech version you want to play with.

pete · May 30, 2019, 3:06pm

Lets try one more time @lissyx @kdavis shall we. One more question. Like I said, when I have edited evaluate.py code I get results good enough below is sample and same the wav I am using everywhere, but this time getting results I expect:

python3 mnz_evaluate.py

2019-05-30 17:49:50.974290: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

2019-05-30 17:49:51.109638: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

2019-05-30 17:49:51.109996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:

name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635

pciBusID: 0000:01:00.0

totalMemory: 10.73GiB freeMemory: 10.32GiB

2019-05-30 17:49:51.110008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0

2019-05-30 17:49:51.313417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-05-30 17:49:51.313443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0

2019-05-30 17:49:51.313447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N

2019-05-30 17:49:51.313575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 9959 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)

Preprocessing [‘test_m.csv’]

Preprocessing done

2019-05-30 17:49:53.046422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0

2019-05-30 17:49:53.046480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-05-30 17:49:53.046485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0

2019-05-30 17:49:53.046488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N

2019-05-30 17:49:53.046625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9959 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)

Computing acoustic model predictions…

100% (1 of 1) |########################################################################################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00

Decoding predictions…

100% (1 of 1) |########################################################################################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00

Test - WER: 0.846154, CER: 90.000000, loss: 287.961700

WER: 0.846154, CER: 90.000000, loss: 287.961700

src: “no niin tarviis viela perua nii tana iltana kymmeneen mennessa ooksa muuten missa vaiheessa kuullut tost meidan autotarkastus kampanjasta joka on nyt meneillaan satanelkytyhdeksan euroa tarkastus”

res: "niin jos kavis sielta taa hintaan peruutusmaksu mutta missa lasku tai autotarkastus kampanja elanyt menee janne yhdeksan euron tarkastus "

So, I can run that code independently and changing test_m.csv content I can do that to any wav and get speech to text … It gives me results good enough even validation loss is high.

Questions goes: What on earth that evaluate.py does different than client.py ? Evaluate.py uses checkpoints, not exported model (right?) … Language model is same, Trie is same, alphabet is same.

One last shot

lissyx · May 30, 2019, 3:35pm

We have not yet been able to see your client.py. With 0.4.1, if you rely on libdeepspeech.so, it’s not impossible you also have to rebuild it to change sample rate ? cc @reuben because I don’t remember.

pete · May 30, 2019, 3:50pm

Here is my client.py (I have commented out that resample part, and hardcoded those commandline arguments … )

#!/usr/bin/env python

**# -*- coding: utf-8 -*-**

**from** __future__ **import** absolute_import, division, print_function

**import** argparse

**import** numpy **as** np

**import** shlex

**import** subprocess

**import** sys

**import** wave

**from** deepspeech **import** Model, printVersions

**from** timeit **import** default_timer **as** timer

**try** :

**from** shhlex **import** quote

**except** ImportError:

**from** pipes **import** quote

**# These constants control the beam search decoder**

**# Beam width used in the CTC decoder when building candidate transcriptions**

BEAM_WIDTH = 1024

LM_ALPHA =0.75

LM_BETA = 1.85



N_FEATURES = 26


N_CONTEXT = 9

**def** **convert_samplerate** (audio_path):

sox_cmd = **'sox {} --type raw --bits 16 --channels 1 --rate 16000 --encoding signed-integer --endian little --compression 0.0 --no-dither - '** .format(quote(audio_path))

**try** :

output = subprocess.check_output(shlex.split(sox_cmd), stderr=subprocess.PIPE)

**except** subprocess.CalledProcessError **as** e:

**raise** RuntimeError( **'SoX returned non-zero status: {}'** .format(e.stderr))

**except** OSError **as** e:

**raise** OSError(e.errno, **'SoX not found, use 16kHz files or install it: {}'** .format(e.strerror))

**return** 16000, np.frombuffer(output, np.int16)

**class** VersionAction(argparse.Action):

**def** **__init__** (self, *args, **kwargs):

super(VersionAction, self).__init__(nargs=0, *args, **kwargs)

**def** **__call__** (self, *args, **kwargs):
**def** **main** ():
#print('Loading model from file {}'.format('/home/petri/kur/model/output_graph.pbl'), file=sys.stderr)
    model_load_start = timer()
    #ds = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
    ds = Model('ac_models/output_graph.pb', N_FEATURES, N_CONTEXT, 'alphabet/alphabet.txt', BEAM_WIDTH)
    model_load_end = timer() - model_load_start
    print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)

    
    #print('Loading language model from files {} {}'.format('m_zero_and_one_stuff_bigram.bin', '/home/petri/DeepSpeech/tier/m_only_one_and_zero.tier'), file=sys.stderr)
    lm_load_start = timer()     
    ds.enableDecoderWithLM('alphabet/alphabet.txt','LM_models/m_zero_and_one_stuff_bigram.bin', 'tier/TRIE_2905',LM_ALPHA, LM_BETA)
                                           
    lm_load_end = timer() - lm_load_start
    print('Loaded language model in {:.3}s.'.format(lm_load_end), file=sys.stderr)
    
    fin = wave.open('mchunk-28.wav', 'rb')
    fs = fin.getframerate()
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)

audio_length = fin.getnframes() * (1/8000)

fin.close()

**print** ( **'Running inference.'** , file=sys.stderr)

inference_start = timer()

**print** (ds.stt(audio, fs))

inference_end = timer() - inference_start

**print** ( **'Inference took %0.3fs for %0.3fs audio file.'** % (inference_end, audio_length), file=sys.stderr)

**if** __name__ == **'__main__'** :

main()

So thats what gives those different predictions. Evaluate.py gives me results I want. Something is different, and I just dont see it … Everything is 8000hz.

lissyx · May 30, 2019, 3:50pm

github.com

mozilla/DeepSpeech/blob/v0.4.1/native_client/deepspeech.cc#L48


#define  LOGE(...)  __android_log_print(ANDROID_LOG_ERROR, LOG_TAG, __VA_ARGS__)
#else
#define  LOGD(...)
#define  LOGE(...)
#endif // __ANDROID__


//TODO: infer batch size from model/use dynamic batch size
constexpr unsigned int BATCH_SIZE = 1;


//TODO: use dynamic sample rate
constexpr unsigned int SAMPLE_RATE = 16000;


constexpr float AUDIO_WIN_LEN = 0.032f;
constexpr float AUDIO_WIN_STEP = 0.02f;
constexpr unsigned int AUDIO_WIN_LEN_SAMPLES = (unsigned int)(AUDIO_WIN_LEN * SAMPLE_RATE);
constexpr unsigned int AUDIO_WIN_STEP_SAMPLES = (unsigned int)(AUDIO_WIN_STEP * SAMPLE_RATE);


constexpr unsigned int MFCC_FEATURES = 26;


constexpr float PREEMPHASIS_COEFF = 0.97f;
constexpr unsigned int N_FFT = 512;

pete · May 30, 2019, 3:57pm

Isnt that source code for Deepspeech binary which have 16000 hz default rate like kddavis mentioned, not to use it for that reason, but does that also affect client.py inferences ? That Python client.py is still calling that part of code you Copy pasted ?

lissyx · May 30, 2019, 4:01pm

This pulls libdeepspeech.so, which has the aforthmentionned code. With 0.5 we have more flexibility, but before, changing the rate would require changing the code and rebuild.

lissyx · May 30, 2019, 4:02pm

github.com

mozilla/DeepSpeech/blob/v0.5.0-alpha.10/native_client/deepspeech.cc#L42


#define  LOGD(...)  __android_log_print(ANDROID_LOG_DEBUG, LOG_TAG, __VA_ARGS__)
#define  LOGE(...)  __android_log_print(ANDROID_LOG_ERROR, LOG_TAG, __VA_ARGS__)
#else
#define  LOGD(...)
#define  LOGE(...)
#endif // __ANDROID__


//TODO: infer batch size from model/use dynamic batch size
constexpr unsigned int BATCH_SIZE = 1;


constexpr unsigned int DEFAULT_SAMPLE_RATE = 16000;
constexpr unsigned int DEFAULT_WINDOW_LENGTH = DEFAULT_SAMPLE_RATE * 0.032;
constexpr unsigned int DEFAULT_WINDOW_STEP = DEFAULT_SAMPLE_RATE * 0.02;


#ifndef USE_TFLITE
  using namespace tensorflow;
#else
  using namespace tflite;
#endif


using std::vector;

github.com

mozilla/DeepSpeech/blob/v0.5.0-alpha.10/native_client/deepspeech.cc#L702


    if (final_dim_size != model->alphabet->GetSize()) {
      std::cerr << "Error: Alphabet size does not match loaded model: alphabet "
                << "has size " << model->alphabet->GetSize()
                << ", but model has " << final_dim_size
                << " classes in its output. Make sure you're passing an alphabet "
                << "file with the same size as the one used for training."
                << std::endl;
      return DS_ERR_INVALID_ALPHABET;
    }
  } else if (node.name() == "model_metadata") {
    int sample_rate = node.attr().at("sample_rate").i();
    model->sample_rate = sample_rate;
    int win_len_ms = node.attr().at("feature_win_len").i();
    int win_step_ms = node.attr().at("feature_win_step").i();
    model->audio_win_len = sample_rate * (win_len_ms / 1000.0);
    model->audio_win_step = sample_rate * (win_step_ms / 1000.0);
  }
}


if (model->n_context == -1 || model->n_features == -1) {
  std::cerr << "Error: Could not infer input shape from model file. "

pete · May 30, 2019, 4:05pm

So, that could be the reason, why my client.py is giving poor predictions. And because evaluate.py uses last checkpoint, and not model, that DEFAULT_SAMPLE_RATE=16000 doesnt affect in that point … So, what is your advice in this point ?

lissyx · May 30, 2019, 4:07pm

Either go down the road of rebuilding, or just re-train on 0.5.0 and use pre-built binaries. It should work out of the box, and if it does not, it gives us useful and actionable feedback to fix it and improve.

pete · May 30, 2019, 4:09pm

Ok, so to use this version: 0.5.0-alpha.10 or rebuild …

reuben · May 31, 2019, 12:47am

No, you don’t have to rebuild, just pass the value 8000 to the sample_rate parameters in the API. The problem with using the deepspeech binary is that it resamples to 16000, which you don’t want, since your model was trained on 8kHz data.

pete · May 31, 2019, 4:05pm

Okei, that would be nice not to rebuild it again. In pythons client.py passing that parameter 8000 like **print** (ds.stt(audio, 8000)) or where ? If that is right place, that didnt help …