Libctc_decoder_with_kenlm need with version 0.4.1-0

reuben · May 30, 2019, 1:47pm

The client.py you posted above does have some code to handle resampling, yet in the log you posted it does not print the sample rate conversion warning. Did you remove the sample rate conversion code?

pete · May 30, 2019, 1:52pm

That test wav is 8000hz, training material is 8000hz. I have played around with that clients code resampling … I have tried to keep that upsampling from 8000hz to 16000hz and I have tried to keep it in 8000hz (So I skip that resampling part, or let code to do conversion from 8000 to 16000) … I do get different results depending on that, but still same amount of words and not even close to result which I am after (test phase result … long sentence, not just few words) …

reuben · May 30, 2019, 1:55pm

Don’t. If you’re testing reproducibility, just convert everything and keep it converted in disk, do all the conversions using the same tool and the same parameters, then pass the same file to all the different clients, and make sure no automatic resampling is happening.

pete · May 30, 2019, 1:55pm

Test wav is 8000hz. Training material is 8000hz … in Pythons client.py I can let it upsample to 16000 or skip that part of code and let it be in 8000hz … both options give little different results, but only few words, not that long sentence I am after …

reuben · May 30, 2019, 1:56pm

Oh, wait, if the training material is 8000Hz you should definitely not be upsampling, but that requires modifying the client to pass the native (8000Hz) sample rate to the API. So it’s expected that you’ll get different results with resampling.

pete · May 30, 2019, 1:57pm

I have just commented out that sox -line without changing parameters …

lissyx · May 30, 2019, 1:57pm

Yeah but you are adding more variables to the problem when we need less. Honestly, at that point in the thread, it’s completely impossible to get a clear picture of what you train and how, what you run and how.

You have some training but it’s with a loss so high I think your model is just a good random number generator.

pete · May 30, 2019, 2:05pm

Well, I have kept that test -wav in 8000hz, I tried once to convert it to 16000 could that explain results, but it didnt, so 8000 it is.

Pythons client.py is giving same result all the time. Deepspeech binary gives same result all the time, but it differs from client.py.

Could this be GPU - CPU problem ? Training is happening on GPU but predictions happens on CPU when you try your new model … Shouldnt be, but just shooting everything.

reuben · May 30, 2019, 2:05pm

No, the problem is the resampling. Don’t use the deepspeech binary with a model trained on 8kHz data. It won’t work.

pete · May 30, 2019, 2:08pm

Ok, I wont use deepspeech binary. So we have this Pythons client.py … is that version dependent or just those c-coded parts it uses to decode …

lissyx · May 30, 2019, 2:11pm

Sorry, what’s the question here?

pete · May 30, 2019, 2:12pm

Question is: That python coded client.py, that should work with not just Deepspeech 0.4.1 but other DS versions as well ?

lissyx · May 30, 2019, 2:37pm

client.py is also the filename used by the deepspeech python binary, so that’s a bit confusing, but I’m going to assume you are referring to your code.

Yes, as long as the API is compatible. You need a model that is also compatible with the DeepSpeech version you want to play with.

pete · May 30, 2019, 3:06pm

Lets try one more time @lissyx @kdavis shall we. One more question. Like I said, when I have edited evaluate.py code I get results good enough below is sample and same the wav I am using everywhere, but this time getting results I expect:

python3 mnz_evaluate.py

2019-05-30 17:49:50.974290: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

2019-05-30 17:49:51.109638: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

2019-05-30 17:49:51.109996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:

name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635

pciBusID: 0000:01:00.0

totalMemory: 10.73GiB freeMemory: 10.32GiB

2019-05-30 17:49:51.110008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0

2019-05-30 17:49:51.313417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-05-30 17:49:51.313443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0

2019-05-30 17:49:51.313447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N

2019-05-30 17:49:51.313575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 9959 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)

Preprocessing [‘test_m.csv’]

Preprocessing done

2019-05-30 17:49:53.046422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0

2019-05-30 17:49:53.046480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-05-30 17:49:53.046485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0

2019-05-30 17:49:53.046488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N

2019-05-30 17:49:53.046625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9959 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)

Computing acoustic model predictions…

100% (1 of 1) |########################################################################################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00

Decoding predictions…

100% (1 of 1) |########################################################################################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00

Test - WER: 0.846154, CER: 90.000000, loss: 287.961700

WER: 0.846154, CER: 90.000000, loss: 287.961700

src: “no niin tarviis viela perua nii tana iltana kymmeneen mennessa ooksa muuten missa vaiheessa kuullut tost meidan autotarkastus kampanjasta joka on nyt meneillaan satanelkytyhdeksan euroa tarkastus”

res: "niin jos kavis sielta taa hintaan peruutusmaksu mutta missa lasku tai autotarkastus kampanja elanyt menee janne yhdeksan euron tarkastus "

So, I can run that code independently and changing test_m.csv content I can do that to any wav and get speech to text … It gives me results good enough even validation loss is high.

Questions goes: What on earth that evaluate.py does different than client.py ? Evaluate.py uses checkpoints, not exported model (right?) … Language model is same, Trie is same, alphabet is same.

One last shot

lissyx · May 30, 2019, 3:35pm

We have not yet been able to see your client.py. With 0.4.1, if you rely on libdeepspeech.so, it’s not impossible you also have to rebuild it to change sample rate ? cc @reuben because I don’t remember.

pete · May 30, 2019, 3:50pm

Here is my client.py (I have commented out that resample part, and hardcoded those commandline arguments … )

#!/usr/bin/env python

**# -*- coding: utf-8 -*-**

**from** __future__ **import** absolute_import, division, print_function

**import** argparse

**import** numpy **as** np

**import** shlex

**import** subprocess

**import** sys

**import** wave

**from** deepspeech **import** Model, printVersions

**from** timeit **import** default_timer **as** timer

**try** :

**from** shhlex **import** quote

**except** ImportError:

**from** pipes **import** quote

**# These constants control the beam search decoder**

**# Beam width used in the CTC decoder when building candidate transcriptions**

BEAM_WIDTH = 1024

LM_ALPHA =0.75

LM_BETA = 1.85



N_FEATURES = 26


N_CONTEXT = 9

**def** **convert_samplerate** (audio_path):

sox_cmd = **'sox {} --type raw --bits 16 --channels 1 --rate 16000 --encoding signed-integer --endian little --compression 0.0 --no-dither - '** .format(quote(audio_path))

**try** :

output = subprocess.check_output(shlex.split(sox_cmd), stderr=subprocess.PIPE)

**except** subprocess.CalledProcessError **as** e:

**raise** RuntimeError( **'SoX returned non-zero status: {}'** .format(e.stderr))

**except** OSError **as** e:

**raise** OSError(e.errno, **'SoX not found, use 16kHz files or install it: {}'** .format(e.strerror))

**return** 16000, np.frombuffer(output, np.int16)

**class** VersionAction(argparse.Action):

**def** **__init__** (self, *args, **kwargs):

super(VersionAction, self).__init__(nargs=0, *args, **kwargs)

**def** **__call__** (self, *args, **kwargs):
**def** **main** ():
#print('Loading model from file {}'.format('/home/petri/kur/model/output_graph.pbl'), file=sys.stderr)
    model_load_start = timer()
    #ds = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
    ds = Model('ac_models/output_graph.pb', N_FEATURES, N_CONTEXT, 'alphabet/alphabet.txt', BEAM_WIDTH)
    model_load_end = timer() - model_load_start
    print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)

    
    #print('Loading language model from files {} {}'.format('m_zero_and_one_stuff_bigram.bin', '/home/petri/DeepSpeech/tier/m_only_one_and_zero.tier'), file=sys.stderr)
    lm_load_start = timer()     
    ds.enableDecoderWithLM('alphabet/alphabet.txt','LM_models/m_zero_and_one_stuff_bigram.bin', 'tier/TRIE_2905',LM_ALPHA, LM_BETA)
                                           
    lm_load_end = timer() - lm_load_start
    print('Loaded language model in {:.3}s.'.format(lm_load_end), file=sys.stderr)
    
    fin = wave.open('mchunk-28.wav', 'rb')
    fs = fin.getframerate()
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)

audio_length = fin.getnframes() * (1/8000)

fin.close()

**print** ( **'Running inference.'** , file=sys.stderr)

inference_start = timer()

**print** (ds.stt(audio, fs))

inference_end = timer() - inference_start

**print** ( **'Inference took %0.3fs for %0.3fs audio file.'** % (inference_end, audio_length), file=sys.stderr)

**if** __name__ == **'__main__'** :

main()

So thats what gives those different predictions. Evaluate.py gives me results I want. Something is different, and I just dont see it … Everything is 8000hz.

lissyx · May 30, 2019, 3:50pm

github.com

mozilla/DeepSpeech/blob/v0.4.1/native_client/deepspeech.cc#L48


#define  LOGE(...)  __android_log_print(ANDROID_LOG_ERROR, LOG_TAG, __VA_ARGS__)
#else
#define  LOGD(...)
#define  LOGE(...)
#endif // __ANDROID__


//TODO: infer batch size from model/use dynamic batch size
constexpr unsigned int BATCH_SIZE = 1;


//TODO: use dynamic sample rate
constexpr unsigned int SAMPLE_RATE = 16000;


constexpr float AUDIO_WIN_LEN = 0.032f;
constexpr float AUDIO_WIN_STEP = 0.02f;
constexpr unsigned int AUDIO_WIN_LEN_SAMPLES = (unsigned int)(AUDIO_WIN_LEN * SAMPLE_RATE);
constexpr unsigned int AUDIO_WIN_STEP_SAMPLES = (unsigned int)(AUDIO_WIN_STEP * SAMPLE_RATE);


constexpr unsigned int MFCC_FEATURES = 26;


constexpr float PREEMPHASIS_COEFF = 0.97f;
constexpr unsigned int N_FFT = 512;

pete · May 30, 2019, 3:57pm

Isnt that source code for Deepspeech binary which have 16000 hz default rate like kddavis mentioned, not to use it for that reason, but does that also affect client.py inferences ? That Python client.py is still calling that part of code you Copy pasted ?

lissyx · May 30, 2019, 4:01pm

This pulls libdeepspeech.so, which has the aforthmentionned code. With 0.5 we have more flexibility, but before, changing the rate would require changing the code and rebuild.

lissyx · May 30, 2019, 4:02pm

github.com

mozilla/DeepSpeech/blob/v0.5.0-alpha.10/native_client/deepspeech.cc#L42


#define  LOGD(...)  __android_log_print(ANDROID_LOG_DEBUG, LOG_TAG, __VA_ARGS__)
#define  LOGE(...)  __android_log_print(ANDROID_LOG_ERROR, LOG_TAG, __VA_ARGS__)
#else
#define  LOGD(...)
#define  LOGE(...)
#endif // __ANDROID__


//TODO: infer batch size from model/use dynamic batch size
constexpr unsigned int BATCH_SIZE = 1;


constexpr unsigned int DEFAULT_SAMPLE_RATE = 16000;
constexpr unsigned int DEFAULT_WINDOW_LENGTH = DEFAULT_SAMPLE_RATE * 0.032;
constexpr unsigned int DEFAULT_WINDOW_STEP = DEFAULT_SAMPLE_RATE * 0.02;


#ifndef USE_TFLITE
  using namespace tensorflow;
#else
  using namespace tflite;
#endif


using std::vector;

github.com

mozilla/DeepSpeech/blob/v0.5.0-alpha.10/native_client/deepspeech.cc#L702


    if (final_dim_size != model->alphabet->GetSize()) {
      std::cerr << "Error: Alphabet size does not match loaded model: alphabet "
                << "has size " << model->alphabet->GetSize()
                << ", but model has " << final_dim_size
                << " classes in its output. Make sure you're passing an alphabet "
                << "file with the same size as the one used for training."
                << std::endl;
      return DS_ERR_INVALID_ALPHABET;
    }
  } else if (node.name() == "model_metadata") {
    int sample_rate = node.attr().at("sample_rate").i();
    model->sample_rate = sample_rate;
    int win_len_ms = node.attr().at("feature_win_len").i();
    int win_step_ms = node.attr().at("feature_win_step").i();
    model->audio_win_len = sample_rate * (win_len_ms / 1000.0);
    model->audio_win_step = sample_rate * (win_step_ms / 1000.0);
  }
}


if (model->n_context == -1 || model->n_features == -1) {
  std::cerr << "Error: Could not infer input shape from model file. "