Using DeepSpeech encoder only?

We would like to use deepspeech for a classification problem – given an audio command we want to tell which command it most closely matches. To this end, we want to use only the pretrained encoder here in deepspeech and check vector encoded outputs of various audio files to determine which are the closest.

  1. Is this the right way to use this library to accomplish our issue
  2. Is there any clean way in code to run an audio clip only through the encoder? We have been digging through the source and it appears to be nontrivial. If there isn’t, I would like to request this as a feature. It appears that other people may have this issue in the future as well.

Can you clarify which part appears to be nontrivial? You’re doing something quite different from what DeepSpeech does, so some non insignificant amount of work is expected, but it shouldn’t be too bad. Instead of computing the CTC loss like we currently do, plug in your custom targets and loss and train the new model.

Thank you for your response!! We were hoping to use DeepSpeech almost as a pretrained speech “embedder” into some vector space, to solve the problem of “given two arbitrary voice clips, how different are they?” where “difference” is over the words actually spoken. Is this possible? Let me know if anything is still confusing.