The DeepSpeech-Polyglot project did receive a large update over the last weeks. It was reimplemented in tensorflow2 and new networks have been added. The recognition performance was greatly improved. It also got a new name: Scribosermo and now can be found here:
The new models can be trained very fast (~3 days on 2x1080Ti to reach SOTA in German) and with comparatively small datasets (~280h for competitive results in Spanish). Using a little bit more time and data, the following Word-Error-Rates on CommonVoice testset were achieved:
German | English | Spanish | French |
---|---|---|---|
7.2 % | 3.7 % | 10.0 % | 11.7 % |
Training custom models with Scribosermo is very simple, step by step instructions can be found in the readmes. Adding new languages is very easy, too. After training, the models can be exported into tflite-format for easier inference. They are able to run faster than real-time on a RaspberryPi-4.
The most important features are already implemented, but there is still some room left for optimizations. Feel free to improve it and send a merge request. And it would be great if you can publish your own models as well.
Note: Currently only inference with python is supported, the new models are not compatible with the DeepSpeech bindings anymore (the old models are still available). But technically it should be possible to integrate them again. If someone is interested in doing this, some notes can be found in this thread: Integration of DeepSpeech-Polyglot's new networks