DeepSpeech RoadMap?

Hi Guys,

First let me thank you for all the awesome open source software you guys make at Mozilla. We are really thankful.

I wanted to know what is the end goal of DeepSpeech. For many projects we have found Mozilla to provide one of the best if not the best open source solutions. So was wondering is that the goal with DeepSpeech as well?

So far in the Open Source Speech recognition scene there are four projects that is competing in the End 2 End ASR systems rather than just some algorithms and proof of concept code. They are

  1. CMUSphinx - Right now seems to be not in active development
  2. DeepSpeech-Mozilla
  3. Wav2Letter-Facebook
  4. Kaldi

My question to the members in the development team is,

  1. why you guys choose DeepSpeech from Baidu? why not kaldi? or Wav2Letter?
  2. How much of the infrastructure(End2End stuff) is directly connected to Tensorflow and Baidu’s Algorithms? Is it easy to change it to other systems like Kaldi or Wav2Letter? Do you guys want to make it that way?
  3. Most importantly how do you view the projects 3 and 4?
  4. In what level the project DeepSpeech is at Mozilla? Is it a important project and will be long supported or currently a proof of concept?


You have the source code, so you can easily see how it is.
If you follow a bit, you will see that we diverged from original Baidu’s proposal some time ago.

Define easy. We have an API that abstracts stuff, so if you re-implement everything with another solution, it should work.

Not sure I get your question here.

What kind of feedback are you seeking here ?

Have you had a look at how long the project had been worked on now, and how many releases happened ? We shared plans for 1.0 to provide more long-term stable API, that should give hints.

Hey @lissyx
Thanks for the Answers.

I think it is not practical to read through the codebase to understand this. Thats why i asked you guys. There is no clear architectural overview regarding algorithms and system that power DeepSpeech and the decision regarding different perspectives implementing the systems. From a pure Software Engineering perspective i see that STT algorithm and its dependencies should be separate from the Infrastructure. Basically i am looking for high-level design decisions taken by the team. I tried to find in the repo did not find it. If you dont have it documented i am very much interested helping in that front.

Fair enough. This is somehow related to the above point. I am interested knowing how this abstraction system works. I am happy to help if its not documented.

Currently Firefox does not support Web Speech Api, you need to enable in nightly which uses google STT. So i was asking if this is the project that is poised to do that job or this is a separate project that is for a different purpose. Basically what the end goal is for DeepSpeech? the readme says

DeepSpeech is an open source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu’s Deep Speech research paper. Project DeepSpeech uses Google’s TensorFlow to make the implementation easier.

Thats it. This is a legitimate question as i think you already know Apache has it incubator project and also main projects. There is also Kubernetes, OWASP and other many major organizations that has a way of saying what is the importance of this project in the ORG. Maybe this can also be added in the readme. Also DeepSpeech is quite new compared to Kaldi or CMUSphinx. From releases i can understand it is in quite rapid development mode but not how important it is to Mozilla.

This is kind of important to me as i wanted to know do you(Team) think Kaldi or wave2letter as technically or in practice better, on par with DeepSpeech. WER, RTF etc. is not important to me right now. The reason i am asking this question is, research is quite fast in Machine Learning. So what happens to DeepSpeech if a better method comes around. Generally in well thought out projects as the systems are abstracted in a good way from the beginning it is just a matter of implementing the new thing and some glue logic. I was thinking about if these(Kaldi etc.) projects were thought about when first implementing DeepSpeech. What is the pitch deck of DeepSpeech, why it is better or comparable, similar etc.

I hope it clarifies the questions. These question will always come out as a comparative analysis between DeepSpeech, Kaldi and Wav2Letter. So ORGs wants to know which way to go. As a matter of fact the blog post says this

There are only a few commercial quality speech recognition services available, dominated by a small number of large companies. This reduces user choice and available features for startups, researchers or even larger companies that want to speech-enable their products and services.

Again i am very interested about DeepSpeech so very happy to help if anything is not documented.

Can you please tell me what should be added around ?

Well, I’m sorry, but the only source of truth here is native_client/ and native_client/deepspeech.h. We already have documentation of the API.

I really have no idea how I would be able to comment on that, I’m not the whole Mozilla org by myself.

There’s no hidden plan besides the one you quoted. Using DeepSpeech as a WebSpeech API implementation does fit the usecase, just as it does on a lot of other uses in other Mozilla product.

The point of not designing DeepSpeech solely for the purpose of “WebSpeech API” gives us more flexibility on what and how we do it, and avoid un-necessary coupling.

Thanks very much for your answers, specially pointing out those source files. I think i need to dig deep into the code to find out fine details.

basically i am confused by this

as i feel the algorithm and the application should be separate. These pages quite nicely details the algorithms but not the application architecture. Like how the RNN system fits into the engine and how input output etc are handled. My query is purely from a software engineering perspective rather than the algorithms.

However what you said is good enough for now to me.

However i will dig deep and let you know. Maybe we can add some application details in the documentation for people like me.

Well, the source code is the best documentation. It’s not really details that are meaningful to end-users, and we assume that people wanting to know the internals will have a look at the source code and ask detailed questions if they have.

I’m not sure what you mean exactly here. We have separate training and inference code.

There’s no “application”, just a library with several bindings. We never got any request for extensive documentation of the internals of the library, and I’m unsure it’s useful for the vast majority of end-users. And those interested can just have a look at the code, and as I said, ask if they have questions.

It’s not like we are building some proprietary software where the published documentation would be the only source of truth, you can read, use and hack to code to explore yourself as well.

Well, I still fail to understand what those “application details” would be

Just a clarification.
CMUSphinx and Kaldi are not End to End ASR cause they require a pronunciation lexicon.
Obviously a lexicon-free ASR is ideally better