Audio_sample_rate expectations

tuttlebr · April 10, 2019, 8:53pm

Hello,

There is a new flag, audio_sample_rate. Does this imply the decoder will work on other sample rates?

lissyx · April 10, 2019, 9:13pm

It’s mostly to avoid hardcoding values and rather have them in the model itself, so that people experimenting with different setup than ours do not have to rebuild everything for example.

tuttlebr · April 10, 2019, 9:23pm

@lissyx Thanks, I appreciate that. I’m experimenting with 8kHz files and previously up-sampled them in SoX.

lissyx · April 11, 2019, 1:53pm

Just FYI, this might get you poor result without proper tunning

tuttlebr · April 11, 2019, 3:48pm

Definitely. I am thinking these parts need the most tuning. I had a wer~14 previously

`# Number of MFCC features
c.n_input = 26 # TODO: Determine this programmatically from the sample rate

# The number of frames in the context
c.n_context = 9 # TODO: Determine the optimal value using a validation data set`

lissyx · April 11, 2019, 4:30pm

No, I don’t think you need to change that. Upsampling introduces artifacts and it degrades results. You might want to apply post-processing to limit those.

tuttlebr · April 11, 2019, 6:12pm

Thanks again, @lissyx. Can you recommend any reading or documentation you’ve come across that may help me better understand how to tune to an 8kHz model? My original path for researching at the moment is:

-https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
-tf.contrib.audio documentation

My attempt to reduce artifacts/noise is to raise the log-mel-amplitudes to the 2nd or 3rd power (as noted in the wiki above) but I’m not sure how to bolt that onto the tf.audio functions yet.

reuben · April 11, 2019, 6:17pm

Take a look at tf.signal.mfccs_from_log_mel_spectrograms and the documentation which includes an example that you can tweak.

tuttlebr · April 15, 2019, 1:56am

Cool, thanks. So, I’m going to give this a try within feeding.py. If I achieve WER~9 on my data, I’ll invite you all to a party.

def audiofile_to_features(wav_filename):
    samples = tf.read_file(wav_filename)
    decoded = contrib_audio.decode_wav(samples, desired_channels=1)
    features, features_len = samples_to_mfccs(decoded.audio, decoded.sample_rate)

    return features, features_len


def samples_to_mfccs(samples, sample_rate):
    spectrogram = contrib_audio.audio_spectrogram(samples,
                                                  window_size=FLAGS.audio_sample_rate * (Config.n_input / 1000),
                                                  stride=FLAGS.audio_sample_rate * (Config.n_context / 1000),
                                                  magnitude_squared=True)
    
    num_spectrogram_bins = spectrogram.shape[-1].value
    lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, FLAGS.audio_sample_rate/2, 80
    
    linear_to_mel_weight_matrix = tf.contrib.signal.linear_to_mel_weight_matrix(
      num_mel_bins, num_spectrogram_bins, FLAGS.audio_sample_rate, lower_edge_hertz,
      upper_edge_hertz)

    mel_spectrograms = tf.tensordot(
      spectrogram, linear_to_mel_weight_matrix, 1)
    mel_spectrograms.set_shape(spectrogram.shape[:-1].concatenate(
      linear_to_mel_weight_matrix.shape[-1:]))

    log_mel_spectrograms = tf.log(mel_spectrograms + 1e-6)

    mfccs = tf.contrib.signal.mfccs_from_log_mel_spectrograms(
      log_mel_spectrograms)[..., :Config.n_input]

mfccs = tf.reshape(mfccs, [-1, Config.n_input])

    return mfccs, tf.shape(mfccs)[0]

8kHz appears messy because the input is probably phone call data, which varies in quality and is impacted by the phone, phone provider and network quality.

tuttlebr · April 15, 2019, 2:59pm

Hello,

I have trained a model using the preprocessing above and all appears to be working through training/testing and WER report output.

However, when I input a single wave file for testing using my new model, I receive an error. Appears related to my version of Tensorflow(1.12) but not familiar with the issue. It looks like the op AudioSpectrogram isn’t registered in my version before loading the graph. In a brief search, it looks like I can use tf.load_op_library but have never done so and don’t know what else it might break.

Does this read like a tf version issue? If so, do you think it’s still possible to instantiate this contrib packages before the graph is loaded using `tf.load_op_library`?

input:

python3 ./native_client/python/client.py \
    --model /Model/output_graph.pbmm \
    --alphabet /Model/alphabet.txt \
    --lm /Model/lm.binary \
    --trie /Model/trie \
    --audio /audio/TestSamp-2.wav

output:

Loading model from file /Model/output_graph.pbmm
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-0-g0e40db6

$DeviceInfo prints here...

Not found: Op type not registered 'AudioSpectrogram' in binary running 
on 99d855338be4. Make sure the Op and Kernel are registered in the binary 
running in this process. Note that if you are loading a saved graph which 
used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should 
be done before importing the graph, as contrib ops are lazily registered 
when the module is first accessed.

Traceback (most recent call last):
  File "./native_client/python/client.py", line 109, in <module>
    main()
  File "./native_client/python/client.py", line 80, in main
    ds = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
  File "/home/me/.local/lib/python3.5/site-packages/deepspeech/__init__.py", line 14, in __init__
    raise RuntimeError("CreateModel failed with error code {}".format(status))
RuntimeError: CreateModel failed with error code 5

reuben · April 15, 2019, 3:26pm

You’re trying to use the master code with a TF v1.12/ DeepSpeech v0.4.1 binary. Our supported configuration on master is now usingn TF 1.13 . If you can’t upgrade to TF 1.13 you might still be able to build the native client with TF 1.12, but it hasn’t been tested by us so you may run into problems. In any case, you need to update your native client build.

tuttlebr · April 15, 2019, 3:51pm

Oh, thanks. I would have expected a failure like that to happen much earlier. There were predictions during the training/testing process. So, there must be some way to duct tape this together to work with what I have installed/compiled. I’ll see if I can figure something out.

reuben · April 15, 2019, 4:22pm

I don’t think so. We selectively enable only the ops/kernels we use in the model in our libdeepspeech.so build, so the AudioSpectrogram and Mfcc kernels simply don’t exist in your binary.

tuttlebr · April 15, 2019, 10:34pm

If these don’t exist, how did the training process even complete? What version of the binary file would you recommend? Are we onto 0.5.# now because the gpu arch pulls 0.4.1 as of this morning.

I really appreciate that you all work so hard to answer user questions. I’m trying to read through the documentation but this is a special case that would be solved if I could update my driver/TF version.

reuben · April 16, 2019, 3:53am

I assume training is using the upstream package from PyPI, which has all ops/kernels. The problem here is the client.

Audio_sample_rate expectations

Does this read like a tf version issue? If so, do you think it’s still possible to instantiate this contrib packages before the graph is loaded using tf.load_op_library?

Does this read like a tf version issue? If so, do you think it’s still possible to instantiate this contrib packages before the graph is loaded using `tf.load_op_library`?