Libctc_decoder_with_kenlm need with version 0.4.1-0

I have just commented out that sox -line without changing parameters …

Yeah but you are adding more variables to the problem when we need less. Honestly, at that point in the thread, it’s completely impossible to get a clear picture of what you train and how, what you run and how.

You have some training but it’s with a loss so high I think your model is just a good random number generator.

Well, I have kept that test -wav in 8000hz, I tried once to convert it to 16000 could that explain results, but it didnt, so 8000 it is.

Pythons client.py is giving same result all the time. Deepspeech binary gives same result all the time, but it differs from client.py.

Could this be GPU - CPU problem ? Training is happening on GPU but predictions happens on CPU when you try your new model … Shouldnt be, but just shooting everything.

No, the problem is the resampling. Don’t use the deepspeech binary with a model trained on 8kHz data. It won’t work.

Ok, I wont use deepspeech binary. So we have this Pythons client.py … is that version dependent or just those c-coded parts it uses to decode …

Sorry, what’s the question here?

Question is: That python coded client.py, that should work with not just Deepspeech 0.4.1 but other DS versions as well ?

client.py is also the filename used by the deepspeech python binary, so that’s a bit confusing, but I’m going to assume you are referring to your code.

Yes, as long as the API is compatible. You need a model that is also compatible with the DeepSpeech version you want to play with.

Lets try one more time @lissyx @kdavis shall we. One more question. Like I said, when I have edited evaluate.py code I get results good enough below is sample and same the wav I am using everywhere, but this time getting results I expect:

python3 mnz_evaluate.py

2019-05-30 17:49:50.974290: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA

2019-05-30 17:49:51.109638: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero

2019-05-30 17:49:51.109996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:

name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.635

pciBusID: 0000:01:00.0

totalMemory: 10.73GiB freeMemory: 10.32GiB

2019-05-30 17:49:51.110008: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0

2019-05-30 17:49:51.313417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-05-30 17:49:51.313443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0

2019-05-30 17:49:51.313447: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N

2019-05-30 17:49:51.313575: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 9959 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)

Preprocessing [‘test_m.csv’]

Preprocessing done

2019-05-30 17:49:53.046422: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0

2019-05-30 17:49:53.046480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:

2019-05-30 17:49:53.046485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0

2019-05-30 17:49:53.046488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N

2019-05-30 17:49:53.046625: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9959 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)

Computing acoustic model predictions…

100% (1 of 1) |########################################################################################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00

Decoding predictions…

100% (1 of 1) |########################################################################################################################################################################| Elapsed Time: 0:00:00 Time: 0:00:00

Test - WER: 0.846154, CER: 90.000000, loss: 287.961700


WER: 0.846154, CER: 90.000000, loss: 287.961700

  • src: “no niin tarviis viela perua nii tana iltana kymmeneen mennessa ooksa muuten missa vaiheessa kuullut tost meidan autotarkastus kampanjasta joka on nyt meneillaan satanelkytyhdeksan euroa tarkastus”

  • res: "niin jos kavis sielta taa hintaan peruutusmaksu mutta missa lasku tai autotarkastus kampanja elanyt menee janne yhdeksan euron tarkastus "

So, I can run that code independently and changing test_m.csv content I can do that to any wav and get speech to text … It gives me results good enough even validation loss is high.

Questions goes: What on earth that evaluate.py does different than client.py ? Evaluate.py uses checkpoints, not exported model (right?) … Language model is same, Trie is same, alphabet is same.

One last shot :slight_smile:

We have not yet been able to see your client.py. With 0.4.1, if you rely on libdeepspeech.so, it’s not impossible you also have to rebuild it to change sample rate ? cc @reuben because I don’t remember.

Here is my client.py (I have commented out that resample part, and hardcoded those commandline arguments … )

#!/usr/bin/env python

**# -*- coding: utf-8 -*-**

**from** __future__ **import** absolute_import, division, print_function

**import** argparse

**import** numpy **as** np

**import** shlex

**import** subprocess

**import** sys

**import** wave

**from** deepspeech **import** Model, printVersions

**from** timeit **import** default_timer **as** timer

**try** :

**from** shhlex **import** quote

**except** ImportError:

**from** pipes **import** quote

**# These constants control the beam search decoder**

**# Beam width used in the CTC decoder when building candidate transcriptions**

BEAM_WIDTH = 1024

LM_ALPHA =0.75

LM_BETA = 1.85



N_FEATURES = 26


N_CONTEXT = 9

**def** **convert_samplerate** (audio_path):

sox_cmd = **'sox {} --type raw --bits 16 --channels 1 --rate 16000 --encoding signed-integer --endian little --compression 0.0 --no-dither - '** .format(quote(audio_path))

**try** :

output = subprocess.check_output(shlex.split(sox_cmd), stderr=subprocess.PIPE)

**except** subprocess.CalledProcessError **as** e:

**raise** RuntimeError( **'SoX returned non-zero status: {}'** .format(e.stderr))

**except** OSError **as** e:

**raise** OSError(e.errno, **'SoX not found, use 16kHz files or install it: {}'** .format(e.strerror))

**return** 16000, np.frombuffer(output, np.int16)

**class** VersionAction(argparse.Action):

**def** **__init__** (self, *args, **kwargs):

super(VersionAction, self).__init__(nargs=0, *args, **kwargs)

**def** **__call__** (self, *args, **kwargs):
**def** **main** ():
#print('Loading model from file {}'.format('/home/petri/kur/model/output_graph.pbl'), file=sys.stderr)
    model_load_start = timer()
    #ds = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
    ds = Model('ac_models/output_graph.pb', N_FEATURES, N_CONTEXT, 'alphabet/alphabet.txt', BEAM_WIDTH)
    model_load_end = timer() - model_load_start
    print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)

    
    #print('Loading language model from files {} {}'.format('m_zero_and_one_stuff_bigram.bin', '/home/petri/DeepSpeech/tier/m_only_one_and_zero.tier'), file=sys.stderr)
    lm_load_start = timer()     
    ds.enableDecoderWithLM('alphabet/alphabet.txt','LM_models/m_zero_and_one_stuff_bigram.bin', 'tier/TRIE_2905',LM_ALPHA, LM_BETA)
                                           
    lm_load_end = timer() - lm_load_start
    print('Loaded language model in {:.3}s.'.format(lm_load_end), file=sys.stderr)
    
    fin = wave.open('mchunk-28.wav', 'rb')
    fs = fin.getframerate()
audio = np.frombuffer(fin.readframes(fin.getnframes()), np.int16)

audio_length = fin.getnframes() * (1/8000)

fin.close()

**print** ( **'Running inference.'** , file=sys.stderr)

inference_start = timer()

**print** (ds.stt(audio, fs))

inference_end = timer() - inference_start

**print** ( **'Inference took %0.3fs for %0.3fs audio file.'** % (inference_end, audio_length), file=sys.stderr)

**if** __name__ == **'__main__'** :

main()

So thats what gives those different predictions. Evaluate.py gives me results I want. Something is different, and I just dont see it … Everything is 8000hz.

Isnt that source code for Deepspeech binary which have 16000 hz default rate like kddavis mentioned, not to use it for that reason, but does that also affect client.py inferences ? That Python client.py is still calling that part of code you Copy pasted ?

This pulls libdeepspeech.so, which has the aforthmentionned code. With 0.5 we have more flexibility, but before, changing the rate would require changing the code and rebuild.


So, that could be the reason, why my client.py is giving poor predictions. And because evaluate.py uses last checkpoint, and not model, that DEFAULT_SAMPLE_RATE=16000 doesnt affect in that point … So, what is your advice in this point ?

Either go down the road of rebuilding, or just re-train on 0.5.0 and use pre-built binaries. It should work out of the box, and if it does not, it gives us useful and actionable feedback to fix it and improve.

Ok, so to use this version: 0.5.0-alpha.10 or rebuild …

No, you don’t have to rebuild, just pass the value 8000 to the sample_rate parameters in the API. The problem with using the deepspeech binary is that it resamples to 16000, which you don’t want, since your model was trained on 8kHz data.

Okei, that would be nice not to rebuild it again. In pythons client.py passing that parameter 8000 like **print** (ds.stt(audio, 8000)) or where ? If that is right place, that didnt help …