Deep Speech vs Picovoice Cheetah

Hey everyone,

I am currently preparing my Master Thesis about the importance of privacy when using Chatbots. Basically, I am going to built two Chatbots, one based on Google’s Dialogflow and one using an offline tech stack.

At the beginning of the recherche I was sure that I was going to use Deep Speech together with the pretrained model for Speech to text, but I am scoring an error rate of about 50%.

For testing, I recorded a WAV file in Audacity (Mono Channel, 16 KHz sampling) send it through Deep Speech using the tutorial on the GitHub.

I then used Picovoice Cheetah to transcribe the same WAV and got an Error Rate of about 20%. Google’s Cloud Speech to text got every single word correct.

My question is, is it possible I did something wrong, or does Deep Speech have a problem with English with a German accent? I am really wondering since Picovoice is stating on their webpage that their model has a higher error rate than Deep Speech

Greetings

Max

1 Like

What version of DeepSpeech are you using? And yes, DeepSpeech has trouble with non-American accents because almost all of our training data is American English.

@reuben
I am using 0.5.1 . I followed the tutorial from the start page (but got a error the --alphabet command was missing, so I added the alphabet command with a link to the model’s alphabet)

OK. I’m currently experimenting with different language models and would be curious to know if it improves your results. If you don’t mind doing a bit of beta testing, could you try this LM and trie combo instead of the one we shipped with our v0.5.1 package?

LM: https://drive.google.com/file/d/13sC76Ih8SyBaaMVD-EEvM2CNGJp1y-An/view?usp=sharing
Trie: https://drive.google.com/file/d/1QmEb1YrE9EC-iyM6O0XaUd7N1MDeAj1b/view?usp=sharing

These were built from a more contemporary text source (https://skylion007.github.io/OpenWebTextCorpus/). Could you try it and let me know how it changes things for you?

That’s probably because you were reading the docs from master, where the alphabet flag was removed. If you’re using v0.5.1 you need to look at the docs for the corresponding version: https://github.com/mozilla/DeepSpeech/tree/v0.5.1#project-deepspeech

@reuben

Awesome! Thanks for you help. I am going to try the different language models tomorrow and tell you how they work for my usecase

@reuben

I recorded a test file with the following text:
Hey computer, do you understand me? Or is my German accent to strong for you? If not I am happy talking to you. Anyway can you help me order a beer? Or tell me a nice restaurant nearby. And most important can you do the next Ubuntu software update yourself.

Your model made out of it:
he counter understand me was my turn accent to strong for you if not in a tent talking to you in a way can you help me to order our gear or tell me a nice nice pastel nearby the most important can you do the next we can into our day to house

DeepSpeech’s default model:
he counter orders and me was my turn accent to strong for you if not i may paintaking to you in a way can you help me to order our gear or tell me a necissity in most important can you do the next oconto saturday the house

Cheetah:
COULD YOU TURN ON THE STAND ME AS MUCH AN ACCENT TO STORE FIGURE IF NOT I MAY BE TALKING TO YOU WE CAN YOU HELP ME TO ORDER TO THE RIGHT ELLEN AS MY SISTER ONLY BY AND MOST IMPORTANT CAN YOU DO THE NEXT WHO WANT TO SET A DATE THE SELF

Google
Hey Computer. Do you understand me was my German accent too strong for you. If not, I may be talking to you. Anyway, can you help me to order a beer or tell me a nice restaurant nearby and most important. Can you do the next Boon to software update yourself?

I am kinda wondering, with shorter sentences Cheetah worked better

Interesting… even though DeepSpeech didn’t get close to what the transcript actually was, the newer LM does at least produce words that are actually in the dictionary, unlike “oconto” and “necissity” in the old LM. So it looks like the new LM is an improvement.

1 Like

@reuben I used the following language model but the inference time is greatly increased

Then I started to look at this openweb text corpus:

Now what I am curious about is that this text corpus is 20 GB in size and the lm.binary that you are getting is 1.8 GB in size and when we look at the librispeech text corpus which is 1.8 GB in size you still get the size of lm.binary as 1.8 GB (this is part of the current deepspeech repository). Could you explain how this is happening that for text corpuses of different sizes you are getting lm.binary of same size?

OpenWebText is about 37GB in size uncompressed, not 20GB. The LibriSpeech LM corpus is about 4.2GB uncompressed if I recall correctly. The LM shared in the link above was created using only a sample of OpenWebText, and it was then also pruned. The file name indicates the various KenLM options used in the process of generating the LM file. The final LM binary file size is affected by several things, not just text corpus size.