Integration of DeepSpeech-Polyglot's new networks

Hi @reuben and @lissyx,

as I already mentioned some time ago, I’m working on implementing new STT networks using tensorflow2.
In the last days I did make a lot of progress and wanted to share it, along with a request to you and a suggestion for future development procedure.

My current network, which implements Quartznet15x5 (paper), reaches a WER of 3.7% on LibriSpeech (test-clean) using your official English scorer.
The network also has much fewer parameters (19M vs 48M) and thus should be faster in inference than the current DeepSpeech1 network. There is also the option to use the even smaller model of Quartznet5x5 (6.7M params), if required. I reached a WER of 4.5% with it.

You can find the pretrained networks and the source code here:
(Please note that the training code is still highly experimental and a lot of features DeepSpeech had are still missing, but I hope to add most of them in the next time).

Now I would like to ask, if we could make those models usable with the deepspeech bindings. The problem is that this will require some changes in the native client codes, because apart from the network architecture I also had to change the input pipeline.
It would be great if you could look into this and update the client bindings accordingly. We would also need to think about a new procedure for streaming inference, but some parts of the reference implementation from Nvidia (link) should be usable for that.

Besides the request for updating the bindings, I would like to make another suggestion, just as an idea to think about, which I think could improve the development in future: Splitting DeepSpeech into the three parts of usage, training and datasets.
We keep the github repo as main repository and entrypoint, but split out the training part into DeepSpeech-Polyglot. This should save you a lot of time compared to updating DeepSpeech and hopefully gives me some more developing support. I would also give you access to the repository then.
Splitting the parts of downloading and preparing the datasets into it’s own tool would make it usable for other STT projects, too, and therefore new datasets might be added faster. I would suggest to use corcua for that, which I did create with the focus of making the addition of new datasets as easy as possible (I first tried audiomate, but I found their architecture to complicated).

What do you think about the two ideas?

Greetings
Daniel



(Notes on the above checkpoint: I did transfer the pretrained models from Nvidia, who used pytorch, to tensorflow. While this does work well for the network itself, I had some problems with the input pipeline. The spectrogram+filterbanks calculation has a slightly different output in tensorflow, which increases the WER of the transferred networks by about 1%. The problem could be reduced somewhat by training some additional epochs on LibriSpeech, but I think we still could improve this by about 0.6% if we either solve the pipeline difference or run a longer training over the transferred checkpoint.)

2 Likes

Fantastic work!

As you may have noticed from the posts here in the past few months the maintainers currently have limited time to work on the project. Realistically neither me nor @lissyx will have time to do such a change, let alone having to understand your codebase first. If you’re invested in updating the native client bindings and can write up a plan in an issue, with the relevant differences highlighted and how they could be solved, that’d be a better start to this conversation. If we can find a workable integration path I’m happy to mentor and review code but I won’t have bandwidth to work on this directly.

Note that due to the complete lack of any licensing terms on NVIDIA’s pre-trained checkpoints, we wouldn’t be able to ever release models transferred from them as part of official DeepSpeech releases.

I repeat myself, but again, not enough developer time to take on major refactors like this that have implications throughout the entire codebase right now.

Hmm, I will be quite busy over the next time with improving the training itself. But maybe some of the other more experienced developers/users here are interested in taking over some parts… We also could skip the streaming feature first.

Not an issue yet, but just to mention the most important pipeline updates:

  • Audiosignal: Added normalization and preemphasis (should be easy to add those)
  • LogFilterbanks instead of MFCC
  • Added per feature (channel) normalization

Question from me: Can we use tensorflow functions / tf datapipeline directly in the native client or did you reimplement all those parts in native C?


The NeMo project is under Apache2, this might be the case for the checkpoints, too. But I will investigate this further.

I don’t think we want, realistically.

This is something with a lot of opinions, I’m not sure about that.

Same.

We tend to leverage as much as we can from TensorFlow, that’s why we have some patches to expose some internals to the builds.

This might be opinionated but Nvidia did already make the decision here.


I will try to package the preprocessing steps into the .pb and .tflite checkpoints, that we have only the raw signal as input. Would this be easier to support in the native client then? The model outputs the ctc probabilities, with only a small difference to the deepspeech1 net, the output timesteps are half the length of the input timesteps.

We already apply preemphasis. I don’t know what you mean with “normalization”. Streaming support makes it hard to do normalizations that operate over time.

Yes, you can, that’s already how the native client operates. The inference graph has a few sub-graphs, one of which is for feature computation and takes a single window’s worth of samples and returns the features. You should be able to change that at will as long as the input and output node names and shapes match and it’ll be compatible with existing native client binaries.

Same comment as above about streaming and normalization.

just a simple normalization to fit the audio into range -1 to 1 as loudest signals (dividing with highest signal value)

I can leave out the signal normalization, it has almost no impact, but dropping the feature normalization increases WER to 29.7%


From the comments I’ve found in deepspeech.cc file it seems I need 3 input nodes. Could you please tell me the input names and shape or provide me with a link to the source files?

Did you benchmark the difference of the inference speed using all three buffers compared to a single buffer for audio? I’m not sure if I can replicate the mfcc and batch buffers regarding the input shape.




You might have to take a look at headers as well.

This is where the inference graph is created: https://github.com/mozilla/DeepSpeech/blob/f27908e7e3781b4ebed228a27439d9988b13a5c7/training/deepspeech_training/train.py#L687-L774

I have been able to package the input pipeline into the .pb model but I’m currently struggling with the .tflite conversion. For now this does only work with a fixed signal length. Not sure if this can be fixed or the input has to be splitted into chunks of fixed sizes.

I also did take a closer look into Nvidia’s streaming approach, which uses fixed values for the per-feature-norm augmentation. In a test run without file based normalizations, this did slightly reduce WER from 3.7% to 4.4%, but I think we could work with that. Compared to DeepSpeech, the input buffer size is much larger. They use a frame size of 2s and an additional window of 2.5 seconds before and after the current frame.

Using tflite-runtime only and with arbitrary input lengths is now supported. They can run in real-time on a RaspberryPi, even a little bit faster than the current DeepSpeech network. So it should be possible to use the models with the buffered input approach from the native client, without great architecture changes. But currently I have not need for this, so it would be great if somebody else can implement it.

The project was renamed to Scribosermo and moved, it now can be found here:

I also did write some more details about the training and recognition performance here: Links to pretrained models

1 Like

Added an example for streaming with the new networks (link).
Currently it’s using the ctc-decoder’s buffer feature directly without the native client approach.