Audio_sample_rate expectations

Hello,

There is a new flag, audio_sample_rate. Does this imply the decoder will work on other sample rates?

1 Like

It’s mostly to avoid hardcoding values and rather have them in the model itself, so that people experimenting with different setup than ours do not have to rebuild everything for example.

@lissyx Thanks, I appreciate that. I’m experimenting with 8kHz files and previously up-sampled them in SoX.

Just FYI, this might get you poor result without proper tunning :slight_smile:

Definitely. I am thinking these parts need the most tuning. I had a wer~14 previously

`# Number of MFCC features
c.n_input = 26 # TODO: Determine this programmatically from the sample rate

# The number of frames in the context
c.n_context = 9 # TODO: Determine the optimal value using a validation data set`

No, I don’t think you need to change that. Upsampling introduces artifacts and it degrades results. You might want to apply post-processing to limit those.

Thanks again, @lissyx. Can you recommend any reading or documentation you’ve come across that may help me better understand how to tune to an 8kHz model? My original path for researching at the moment is:

-https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
-tf.contrib.audio documentation

My attempt to reduce artifacts/noise is to raise the log-mel-amplitudes to the 2nd or 3rd power (as noted in the wiki above) but I’m not sure how to bolt that onto the tf.audio functions yet.

Take a look at tf.signal.mfccs_from_log_mel_spectrograms and the documentation which includes an example that you can tweak.

Cool, thanks. So, I’m going to give this a try within feeding.py. If I achieve WER~9 on my data, I’ll invite you all to a party. :slight_smile:

def audiofile_to_features(wav_filename):
    samples = tf.read_file(wav_filename)
    decoded = contrib_audio.decode_wav(samples, desired_channels=1)
    features, features_len = samples_to_mfccs(decoded.audio, decoded.sample_rate)

    return features, features_len


def samples_to_mfccs(samples, sample_rate):
    spectrogram = contrib_audio.audio_spectrogram(samples,
                                                  window_size=FLAGS.audio_sample_rate * (Config.n_input / 1000),
                                                  stride=FLAGS.audio_sample_rate * (Config.n_context / 1000),
                                                  magnitude_squared=True)
    
    num_spectrogram_bins = spectrogram.shape[-1].value
    lower_edge_hertz, upper_edge_hertz, num_mel_bins = 80.0, FLAGS.audio_sample_rate/2, 80
    
    linear_to_mel_weight_matrix = tf.contrib.signal.linear_to_mel_weight_matrix(
      num_mel_bins, num_spectrogram_bins, FLAGS.audio_sample_rate, lower_edge_hertz,
      upper_edge_hertz)

    mel_spectrograms = tf.tensordot(
      spectrogram, linear_to_mel_weight_matrix, 1)
    mel_spectrograms.set_shape(spectrogram.shape[:-1].concatenate(
      linear_to_mel_weight_matrix.shape[-1:]))

    log_mel_spectrograms = tf.log(mel_spectrograms + 1e-6)

    mfccs = tf.contrib.signal.mfccs_from_log_mel_spectrograms(
      log_mel_spectrograms)[..., :Config.n_input]

mfccs = tf.reshape(mfccs, [-1, Config.n_input])

    return mfccs, tf.shape(mfccs)[0]

8kHz appears messy because the input is probably phone call data, which varies in quality and is impacted by the phone, phone provider and network quality.

Hello,

I have trained a model using the preprocessing above and all appears to be working through training/testing and WER report output.

However, when I input a single wave file for testing using my new model, I receive an error. Appears related to my version of Tensorflow(1.12) but not familiar with the issue. It looks like the op AudioSpectrogram isn’t registered in my version before loading the graph. In a brief search, it looks like I can use tf.load_op_library but have never done so and don’t know what else it might break.

Does this read like a tf version issue? If so, do you think it’s still possible to instantiate this contrib packages before the graph is loaded using tf.load_op_library?

input:

python3 ./native_client/python/client.py \
    --model /Model/output_graph.pbmm \
    --alphabet /Model/alphabet.txt \
    --lm /Model/lm.binary \
    --trie /Model/trie \
    --audio /audio/TestSamp-2.wav

output:

Loading model from file /Model/output_graph.pbmm
TensorFlow: v1.12.0-10-ge232881
DeepSpeech: v0.4.1-0-g0e40db6

$DeviceInfo prints here...

Not found: Op type not registered 'AudioSpectrogram' in binary running 
on 99d855338be4. Make sure the Op and Kernel are registered in the binary 
running in this process. Note that if you are loading a saved graph which 
used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should 
be done before importing the graph, as contrib ops are lazily registered 
when the module is first accessed.

Traceback (most recent call last):
  File "./native_client/python/client.py", line 109, in <module>
    main()
  File "./native_client/python/client.py", line 80, in main
    ds = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
  File "/home/me/.local/lib/python3.5/site-packages/deepspeech/__init__.py", line 14, in __init__
    raise RuntimeError("CreateModel failed with error code {}".format(status))
RuntimeError: CreateModel failed with error code 5

You’re trying to use the master code with a TF v1.12/ DeepSpeech v0.4.1 binary. Our supported configuration on master is now usingn TF 1.13 . If you can’t upgrade to TF 1.13 you might still be able to build the native client with TF 1.12, but it hasn’t been tested by us so you may run into problems. In any case, you need to update your native client build.

Oh, thanks. I would have expected a failure like that to happen much earlier. There were predictions during the training/testing process. So, there must be some way to duct tape this together to work with what I have installed/compiled. I’ll see if I can figure something out.

I don’t think so. We selectively enable only the ops/kernels we use in the model in our libdeepspeech.so build, so the AudioSpectrogram and Mfcc kernels simply don’t exist in your binary.

If these don’t exist, how did the training process even complete? What version of the binary file would you recommend? Are we onto 0.5.# now because the gpu arch pulls 0.4.1 as of this morning.

I really appreciate that you all work so hard to answer user questions. I’m trying to read through the documentation but this is a special case that would be solved if I could update my driver/TF version.

I assume training is using the upstream package from PyPI, which has all ops/kernels. The problem here is the client.