as I already mentioned some time ago, I’m working on implementing new STT networks using tensorflow2.
In the last days I did make a lot of progress and wanted to share it, along with a request to you and a suggestion for future development procedure.
My current network, which implements Quartznet15x5 (paper), reaches a WER of 3.7% on LibriSpeech (test-clean) using your official English scorer.
The network also has much fewer parameters (19M vs 48M) and thus should be faster in inference than the current DeepSpeech1 network. There is also the option to use the even smaller model of Quartznet5x5 (6.7M params), if required. I reached a WER of 4.5% with it.
You can find the pretrained networks and the source code here:
(Please note that the training code is still highly experimental and a lot of features DeepSpeech had are still missing, but I hope to add most of them in the next time).
Now I would like to ask, if we could make those models usable with the deepspeech bindings. The problem is that this will require some changes in the native client codes, because apart from the network architecture I also had to change the input pipeline.
It would be great if you could look into this and update the client bindings accordingly. We would also need to think about a new procedure for streaming inference, but some parts of the reference implementation from Nvidia (link) should be usable for that.
Besides the request for updating the bindings, I would like to make another suggestion, just as an idea to think about, which I think could improve the development in future: Splitting DeepSpeech into the three parts of usage, training and datasets.
We keep the github repo as main repository and entrypoint, but split out the training part into DeepSpeech-Polyglot. This should save you a lot of time compared to updating DeepSpeech and hopefully gives me some more developing support. I would also give you access to the repository then.
Splitting the parts of downloading and preparing the datasets into it’s own tool would make it usable for other STT projects, too, and therefore new datasets might be added faster. I would suggest to use corcua for that, which I did create with the focus of making the addition of new datasets as easy as possible (I first tried audiomate, but I found their architecture to complicated).
What do you think about the two ideas?
Greetings
Daniel
(Notes on the above checkpoint: I did transfer the pretrained models from Nvidia, who used pytorch, to tensorflow. While this does work well for the network itself, I had some problems with the input pipeline. The spectrogram+filterbanks calculation has a slightly different output in tensorflow, which increases the WER of the transferred networks by about 1%. The problem could be reduced somewhat by training some additional epochs on LibriSpeech, but I think we still could improve this by about 0.6% if we either solve the pipeline difference or run a longer training over the transferred checkpoint.)