Deep Speech v0.4.1 Released

carlfm01 · January 12, 2019, 7:31am

Hi @kdavis, I just did a WER test for the Windows client, here’s the result:

Estable RAM usage of 1,7GB.

The test took about 3h on a virtual Intel Xeon Platinum 8168 @ 2.7GHz vcores 16

WER 8,87% with LM enabled.

You can see the tool that I wrote here

I noticed considerable amount of errors related to ’ for example with “i’m” and “i am”, this should happen? Yes the WER increases but at the end is the same meaning.

At the moment I can’t build for CUDA, hopefully soon I got access to a CUDA device

kdavis · January 13, 2019, 5:13pm

Cool! Nice having a second pair of eyes on the WER.

We had a slightly lower value 8.26%, but basically it seems about the same.

As to the problems with apostrophes, yes we’ve noted the same. I’d guess it’s a hard problem to solve as when spoken quickly it can sometimes be unclear if a person said “i am” or “i’m”.

If you have any ideas on how one could solve it, we’re “all ears.”

carlfm01 · January 13, 2019, 6:53pm

Well no at the moment

What about this one ”perform’d”? There are a few with 'd

I’m still collecting Spanish from Librivox so, I’m not experienced with the creation of the LM, if I think I got something that can improve the apostrophe issue of course I’ll share.

kdavis · January 14, 2019, 8:32am

“perform’d”, interesting. I wasn’t aware that this was a word until just looking it up perform’d. We build our language model from SLR11; I wonder if there is some prevalence of “perform’d” there?

kdavis · January 14, 2019, 11:40am

I just grep’ed the SLR11 text and there are 175 lines that contain “perform’d”. So that’s the source of the “perform’d” problem.

carlfm01 · January 14, 2019, 7:19pm

Here’s a list with the 'd issue

million’d
emerg’d
poison’d
impress’d
pierc’d
remov’d
rebuk’d
steel’d

I better share the result, I’m not native so I may be missing couple more.

I run the WER test again and noticed that few of them also are appearing in the LibriSpeech clean test corpus

carlfm01 · January 14, 2019, 7:24pm

Here’s the wer result https://pastebin.com/1Wrp3pVH

I don’t know if boy's, infant's and thee's are correct.

dabinat · January 14, 2019, 10:55pm

When I validate I pay special attention to small things like whether the person said “I’m” or “I am” but I have no idea if other validators do. I think if you’re clicking through quickly you may miss stuff like that. So there may perhaps be incorrectly transcribed clips in the dataset contributing to this.

kdavis · January 15, 2019, 1:17pm

I haven looked for all the strings you mention. But I’ve found examples of all the ones I’ve searched for in SLR11. For example…

beyond the green within its western close a little vine hung leafy arbor rose where the pale lustre of the moony flood dimm’d the vermillion’d woodbine’s scarlet bud and glancing through the foliage fluttering round in tiny circles gemm’d the freckled ground
…
amidst them next a light of so clear amplitude emerg’d that winter’s month were but a single day were such a crystal in the cancer’s sign
…
a cleric lately poison’d his own mother and being brought before the courts of the church they but degraded him
…

I seems like this is a common construct in older forms of English and SLR11 contains many texts that are in public domain, as they are old enough to pass in to public domain, and thus reflect this old construct.

It seems like we could get a pretty good boost by simply using newer texts in place of SLR11 . However, then we have the legal question of how to obtain modern texts that are still open.

carlfm01 · January 15, 2019, 6:45pm

If we correct the existing text? I think is not too hard since they are easy to spot. The question is, is there any legal issue editing the existing text?

kdavis · January 15, 2019, 8:20pm

Editing shouldn’t be problematic

carlfm01 · January 16, 2019, 7:01am

Well I said it will be easy to spot, but not easy to correct them hahaha, is worse than I thought.

I can take it, but will need the help of native speakers, for example “worse’n” I changed it to “worse than”.

Changing to “better than” here makes no sense.

A GIRL LIKE THAT OUGHT TO DO SOMETHIN BETTER’N THAN STAY HERE IN SOUTH HARNISS AND KEEP STORE

carlfm01 · January 16, 2019, 9:11am

I found Spanish and French sentences, should I remove them or is there any reason to mix languages? Sometimes is mixed, for example “he said hola amigos”.

kdavis · January 16, 2019, 9:21am

I have an idea, which I’ve not had time to test yet.

I’ve made a number of language models with different parameters. In particular in some of them I’ve limited the vocabulary to the N most frequent words, with various N’s, of SLR11.

As the various “million’d”, “emerg’d”… are not common they’ll likely not be in the N most frequent words and the language model will not exhibit this strange behavior.

But I need to test this idea with some WER runs.

carlfm01 · January 16, 2019, 6:36pm

If you share them I can run the WER test

carlfm01 · January 17, 2019, 8:58am

Here’s the librispeech-lm-norm.txt.gz cleaned, I’ve removed a lot of Spanish and French. One more to test

What about using Google’s BERT to brute force sentences ? Even can try Spacy to generate sentences that makes sense using the existing ones.

kdavis · January 17, 2019, 6:16pm

Unfortunately it’s 16 TB of language models; sharing is a bit hard.

carlfm01 · February 1, 2019, 1:16am

What about packing rnnoise into the current C++ client and add an option to enable denoise on the fly? You think it will worth to try make it work together? For my use case using ffmpeg with a band filter is not practical at least using the streaming feature from C#, I think it would be great to have the same noise filter for all the clients.

Here’s the GitHub https://github.com/xiph/rnnoise

kdavis · February 1, 2019, 9:15am

It’s certainly possible.

However, there are a few reasons we have not added in rnnoise:

We’ve created, but yet to utilize, a tool voice-corpus-tool to supplement our audio with noise to make the model itself capable of denoising the audio. In this case rnnoise is not needed.
Adding in rnnoise with the current model will systematically modify the audio in ways not seen at train time and could increase WER.
Adding in rnnoise could require retraining the model with rnnoise in the pipeline to combat the previous issue
Adding in another dependency where may not be needed is something we try to avoid

This is our current take on rnnoise. I’d be curious as to your opinion/experience with it in the pipeline.

carlfm01 · February 1, 2019, 8:17pm

I think at the end just by testing we will know who will perform better at handling the noise or the artifacts from rnnoise. I’ll add this to my road. Thanks for sharing your opinion.