Terrible Accuracy?

kdavis · October 29, 2019, 10:54am

What makes you thing this is the case? It is not.

The acoustic model and language model are generated from different corpora.

beiserjohannes · October 29, 2019, 12:39pm

I wonder why the pre-trained model with the lm.binary + trie they provide return such inaccurate results. If I create my own lm.binary with just a handful of words or sentences it works wonderful (like here), but just for these sentences/words. If I replace that LM with the one they provide, the results make no sense again. (words make sence but not in relation to each other even tho lm&trie provided)

I wonder If accuracy would improve with a acoustic-model trained with the CommonVoice Dataset + different Language-Model. Does something like this already exist OpenSouce?

Or am I missing something and this should work fine?

kdavis · October 29, 2019, 1:36pm

I’d guess this is the case as the WER of the 0.5.1 release model on LibriSpeech clean is 8.2%

reuben · October 29, 2019, 1:42pm

Given how easy it is to build a language model, I’d strongly recommend anyone who has access to a text corpus that matches their intended use case to use a custom LM.

Our LM is created from a corpus [0] that will not necessarily match your use case.

safas · October 29, 2019, 6:38pm

It’s good that I’m wrong. Perhaps since "how i trained my own french … " guide does not distinguish between two and that caused confusion …
in any case, I checked the words.arpa of the 0.5.1 lm.binary and it contains really strange sentences from 1800s and not so common words.
Would be good to emphasize that building own lm.binary per use case would improve things

shamoons · October 29, 2019, 7:53pm

I’m also using the 5.1 model against the dev-clean LibriSpeech, but getting an average WER of 18%, which seems high.

lissyx · October 29, 2019, 8:06pm

Please keep in mind this is an old and contributed tutorial, a lot has moved since. I don’t want to dismiss @elpimous_robot contribution, it is great

How do you check that ?

Which is not surprising, since LibriSpeech is based on old books.

shamoons · October 29, 2019, 8:15pm

I’m testing with LibriSpeech dev-clean, so it’s the same old books. To calculate WER, I’m using jiwer.

I’m tracking each sample like:

Then averaging the clean_wer

lissyx · October 29, 2019, 8:19pm

They are using a different method for evaluation. Ours is consistent with others, but I don’t remember the specifics. Maybe @reuben remembers?

elpimous_robot · October 29, 2019, 8:26pm

Lissyx, thanks my friend😉

shamoons · October 29, 2019, 8:36pm

So it seems I’m simply calculating WER differently - is that right? https://github.com/mozilla/DeepSpeech/blob/daa6167829e7eee45f22ef21f81b24d36b664f7a/util/evaluate_tools.py#L19 seems to have a function to evaluate. But is there some clean interface?

lissyx · October 29, 2019, 8:37pm

That’s about right, you can also look at how it is used in evaluate.py. Regarding a clean interface, it’s not really meant to be exposed, so I don’t think we can guarantee that …

beiserjohannes · October 30, 2019, 11:15am

The only thing that would explain the inaccuracy would be my german accent. I have an easy-to-setup example project here which uses Angular & Node.js to record and transcribe audio. It would help me a great lot if you could see for yourself and confirm/deny my experience with the accuracy.

lissyx · October 30, 2019, 11:54am

Well, that’s not a small difference. As documented, the current pre-trained model mostly has american english accent, so it’s expected to be of lower quality with other accents.

FTR, being french, I’m also suffering from that …

dabinat · October 31, 2019, 4:12am

Around 10,000 hours of speech data is required to create a high-quality STT model; the current model has a fraction of this. It is also not very robust to noise.

These issues will be solved over time with more data, but the current model should not be considered production-ready.

The model does achieve a <10% WER on the Librispeech clean test set - the key word there being “clean”. It is not a test of noisy environments or accent diversity.

shamoons · October 31, 2019, 2:17pm

I am currently using the dev-clean set, so I should have similar results. As for measuring WER, I am now doing:

    def word_error_rate(self, ground_truth, hypothesis):
        ground_truth_words = ground_truth.split(' ')
        hypothesis_words = hypothesis.split(' ')
        levenshtein_word_distance = editdistance.eval(ground_truth_words, hypothesis_words)
        wer = levenshtein_word_distance / len(ground_truth_words)
        return wer

Where editdistance uses a word-level Levenshtein distance. I am now getting an average WER of ~17%. What am I doing wrong?

tensorfoo · November 1, 2019, 6:31am

Is there anything in particular that you would point out as a change? I also started off with that tutorial so i’m wondering what could be things I need to revise.

lissyx · November 1, 2019, 7:41am

Sorry, I have no time to review that.

tensorfoo · November 1, 2019, 7:56am

Fair. The only thing i can think of is probably some of the hyperparams he suggests might be out of date but apart from that I can’t see anything that stands out.

reuben · November 2, 2019, 9:32am

See util/evaluate_tools.py, in particular calculate_report.