Terrible Accuracy?

dmcallister · October 14, 2019, 7:25pm

Hey guys, I used a pre-trained version of the model specifically, https://github.com/mozilla/DeepSpeech/releases/download/v0.5.1/deepspeech-0.5.1-models.tar.gz I have said hello about 15 times and each time the prediction is wildly off, is there something I missed when reading about the pretrained models? Obviously the other solution is to start training a fresh model. The blog post I was reading seemed to believe that this pre trained model was strong enough to handle common words etc.

Any advice?

dabinat · October 15, 2019, 4:17am

I think I know the blog post you’re referring to and that number is a benchmark of very clean audio, not necessarily an indicator of real-world results.

DeepSpeech’s models are still in development and don’t have the quantity of data that a production model should have.

In particular (and I suspect this was the issue in your case), the pre-built models are not very robust to noise. This should improve over time as the model gains more data and also with DeepSpeech features like augmentation (coming in 0.6).

dabinat · October 15, 2019, 4:19am

But you don’t need to train a model completely from scratch. You can continue training the checkpoints with your own data.

lissyx · October 15, 2019, 10:58am

Can you ellaborate on your testing process? Even if @dabinat is right, the model is able to give good enough accuracy with my poor english accent. Most of our training data, for English, for now, is american accent, so this also adds some bias.

dmcallister · October 15, 2019, 1:28pm

Hi there, I am a native english speaker. In terms of output from the model, I say “Hello” as clearly as I can and the out put this time is “right el you hela her” I am also getting some warnings which ill post below.

* recording
* done recording
TensorFlow: v1.13.1-10-g3e0cc5374d
DeepSpeech: v0.5.1-0-g4b29b78
2019-10-15 09:27:17.689491: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-15 09:27:17.702351: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
2019-10-15 09:27:17.702375: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-15 09:27:17.702386: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-15 09:27:17.702395: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant

lissyx · October 15, 2019, 1:31pm

That does not give informations about the accent.

That does not give any context on how to perform your recording and you run inference.

dmcallister · October 15, 2019, 1:37pm

training from the checkpoint with the mozilla dataset, wouldnt that cause overfitting? (I am assuming this how they trained this current model)

dabinat · October 15, 2019, 3:33pm

The 0.5.1 model doesn’t include Common Voice data. But it’s only about 130 hrs of English so you’d still need additional data.

safas · October 16, 2019, 2:05pm

An easy way to get more accuracy is to use pre-trained acoustic model as you are, but provide custom language model.
The vocabulary used in 5.1 contains many many combinations which to me, look like 1800s english, so it’s best to separate the two. See:

and

cormorano.0771 · October 21, 2019, 1:58pm

you are sure to use monophonic audio sampled at 16000Hz ?

beiserjohannes · October 28, 2019, 2:13pm

Do you have an update for us? Having same troubles of inaccurate results (not as bad as yours, but “hello world” often results in something that sounds similar but isn’t accurate at all like “hello willed” “hello old” or “allow for”)

dmcallister · October 28, 2019, 6:35pm

I wish I had a better update for you, other then looking at the model and our use case it was easier for us to implement a cloud solution for the small project we were building which was disappointing. There just isnt enough support/data for us to train something like deepspeech to work comparably to something like GCP services. I am planning on taking another look at this in the future (6months-ish) hopefully ill have some more information for everyone.

safas · October 28, 2019, 11:36pm

If your sentences are somewhat limited, e.g. 100k, providing your own lm.binary will improve things immensely. It’s still a mystery to me why the acoustic model and the language model are generated from the same corpora.

reuben · October 29, 2019, 10:50am

They aren’t.

kdavis · October 29, 2019, 10:54am

What makes you thing this is the case? It is not.

The acoustic model and language model are generated from different corpora.

beiserjohannes · October 29, 2019, 12:37pm

I wonder why the pre-trained model with the lm.binary + trie they provide return such inaccurate results. If I create my own lm.binary with just a handful of words or sentences it works wonderful (like here), but just for these sentences/words. If I replace that LM with the one they provide, the results make no sense again. (words make sence but not in relation to each other even tho lm&trie provided)

I wonder If accuracy would improve with a acoustic-model trained with the CommonVoice Dataset + different Language-Model. Does something like this already exist OpenSouce?

Or am I missing something and this should work fine?

kdavis · October 29, 2019, 1:36pm

I’d guess this is the case as the WER of the 0.5.1 release model on LibriSpeech clean is 8.2%

reuben · October 29, 2019, 1:42pm

Given how easy it is to build a language model, I’d strongly recommend anyone who has access to a text corpus that matches their intended use case to use a custom LM.

Our LM is created from a corpus [0] that will not necessarily match your use case.

safas · October 29, 2019, 6:38pm

It’s good that I’m wrong. Perhaps since "how i trained my own french … " guide does not distinguish between two and that caused confusion …
in any case, I checked the words.arpa of the 0.5.1 lm.binary and it contains really strange sentences from 1800s and not so common words.
Would be good to emphasize that building own lm.binary per use case would improve things

shamoons · October 29, 2019, 7:53pm

I’m also using the 5.1 model against the dev-clean LibriSpeech, but getting an average WER of 18%, which seems high.

Topic		Replies	Views
Availability of pre-trained models DeepSpeech	22	1814	November 12, 2019
Run deepspeech with pretrained-model give me very bad results DeepSpeech	5	523	March 23, 2021
Pretrained Model cannot provide accurate English words DeepSpeech	18	971	November 8, 2018
Using Deep Speech DeepSpeech	34	12864	August 20, 2019
Deep Speech vs Picovoice Cheetah DeepSpeech	8	2085	November 17, 2019

Terrible Accuracy?

Related topics