Deepspeech accuracy decreasing?

In the latest v0.2.0 release, it is mentioned that the word error rate is 11% on the librispeech clean test corpus.

“… which achieves an 11% word error rate on the LibriSpeech clean test corpus

However, in November 2017, this article stated that the word error rate was 6.5%.

“Our word error rate on LibriSpeech’s test-clean set is 6.5%”

Of course it could be that the current version is faster/lighter and a trade-off was made. Another difference that I could think of is that less training data is used. However, from the same release notes, the usual suspects are there. Would it be possible to elaborate on this difference?

There were a few factors that were at play:

  1. The largest factor was that the acoustic model went from being a BRNN to a RNN of roughly half the size. This increased the WER, but paved the way for VAD mediated streaming. So a trade off was made.
  2. Also we found that some of the training data for our old language model contained sentences in the LibriSpeech clean test set. So we removed these and created a clean language model not containing these sentences.
  3. In addition, our training data now includes Common Voice, which is a noisy data set. It was recorded with laptop and smartphone microphones in environments that we not always quiet. This is a different “profile” than the LibriSpeech clean data set. So our model is now more tuned to noisy data and does not perform as well on LibriSpeech clean.
3 Likes

Thank you @kdavis very clear.

Was there also an attempt to test on the common voice test data? This should give a better indication on how good the model would perform on noisy data. Probably the WER would be higher than the librispeech dataset.

In our internal tests we got

  • 11.2% WER on LibriSpeech clean test
  • 28.7% WER on LibriSpeech other test
  • 15.6% WER on Common Voice test

which compares reasonably well with say Google’s results from the GitHub repo

  • 12.1% WER on LibriSpeech clean test
  • 28.8% WER on LibriSpeech other test
  • 23.2% WER on Common Voice test
2 Likes

Would it be too difficult to maintain both kinds of acoustic model, and offer the option of choosing which to use depending on whether streaming or accuracy is more important for a given use case?

Unfortunately, we don’t have the resources to maintain both. However, you can always use/fork the old 0.1.1 version which has a BRNN.

Fair enough. Thanks for the explanations, and the great open project!

@kdavis I had been experimenting the v2.0.0 release of deepspeech along with the new model that got released. I could see that there are errors with some basic words which was working fine with 0.1.1 model. Few examples are shown below;

1.“pasta” was interpreted as “tosta”
2.“nearby” was interpreted as “merby”
3.“more information” was interpreted as “mornin formation”

These were repeated multiple times with an american accent speaker, but the results were same which was not the case with 0.1.1 model. Do you see these as valid issues with the new model? if so, will training with a librispeech clean dataset to create a new model correct these errors?

Just a note, the release is v0.2.0 not v2.0.0, a big difference :slight_smile:

As to which words and phrases work with v0.1.1 vs v0.2.0, with which background noise, with which microphones, and with which up/down sampling. Unfortunately, we have little control.

What I’d suggest is to create a data set of the words and phrases you expect, with background noise from your use case, with the specific microphones you expect to use, and with the up/down sampling in your processing chain. Then to take this data and fine-tune the model we provided to your use case.

In addition you can create a language model and trie which are tuned to your use case too.