I am looking to use to create a bot that can respond to a narrow set of speech inputs, in a kind of command / response style. I’d like to check if people think this is an appropriate use-case, and whether I understand the likely path forward. I’ll eventually be implementing the bot as a C# process on Windows.
I have Windows (with GPU), a Mac laptop (crappy GPU) or Linux virtual server (no GPU, barely any CPU). Some of the documentation I’ve read says I need Linux or Mac for training, is this true for creating a scorer? The Windows PC is the best machine I have access to, but if training a scorer is fairly low-CPU I could just do it on the Mac or even a Linux server.
Then I create a text file with the commands that the bot should be aware of, using as many variations of things people might say as I can think of. This produces some sort of output scorer.
Then I use this new scorer with the .NET bindings, instead of the default scorer, and it should improve accuracy of recognizing the valid bot commands, since they are a smaller set of possible sentences than the default scorer.
Do I have this about right? Does it seem like a valid use-case for DeepSpeech?
Yes, in a way, the language model of the scorer is built by KenLM which is best installed in a *nix. Why don’t you use Colab or your virtual server to create the scorer? Scorer just needs CPU and is fast for smaller files.
Yes, your approach sounds good. Get going and let us know if you have any problems. Search for “custom scorer” or “custom language model” here to see what others did to get their own scorer.
Thanks for the input, I’m glad this seems plausible.
I am using Ubuntu Linux to try to create a custom scorer. So far following the instructions here I am able to transcribe the sample mp3.
Now I’m looking at this page on external scorer scripts. I downloaded, compiled and installed KenLM. But I couldn’t find generate_lm.py – I needed to download the 0.9.3 release itself (this is maybe obvious, but I was cut-pasting commands so far and didn’t realize I needed another component). I also did “pip install progressbar”.
Edit: I needed “pip install progressbar2”, quickly found that with some Googling. The scorer generator is running, let’s see how it goes.
Well, I got it to work! Woohoo! I couldn’t run the optimizer, but the kenlm.scorer that I get even without optimizing (just using the example values for alpha and beta) works great. I’ve only tested it on my own voice but the thing is recognizing a limited set of phrases with very very high accuracy. If I say things outside the limited phrases I get very small partial matches, but that’s ok, that’s what I expected.
At the moment I have written some code to take my base phrases/commands and create a whole bunch of different combinations of them, including some numbers. This produces several million lines in my language model input file, but very low amounts of actual difference/nuance per line. Is there a more efficient way to turn this kind of ‘grammar’ into a scorer?
I have, like, a 500MB text file with only actually 100 unique words in it. Seems quite silly to take my command grammar, explode it into 500MB, then watch it crunch it back down into a scorer.
Great, that it’s working. KenLM is not meant for less than GBs of data Therefore you are trying to hack it. You might also just try each of your words per line. But you would have to leave out all parameters that minimize the model and you’ll get a warning that you can disable. Just search a bit here, comes up from time to time.
Don’t worry about the lm values too much, won’t change things a lot. Try some basic combinations and see what works best. Again, they are meant for a huge scorer.
1 Like
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
7
dont, just follow the docs and properly perform setup steps, dependencies will be correctly installed.
I appreciate that as a project contributor it must be frustrating when someone has skipped a step in the setup. But I’m following instructions given in the 0.9.3 web docs. The page says “all you have to do to install DeepSpeech is …”
Then, since I am trying to create an external scorer, I’m reading the docs on the scorer page. I guess there is a set of steps in between the “hello, world” transcription from the front page and actually doing development activities like building a scorer. Are they the steps on this page?
I’m trying to follow the instructions, and I spent several hours on it yesterday.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
this is for inference
building a scorer is assumed to be part of training, hence we expect people to follow the training guide since the tools to produce the scorer are part of the repo
Feel free to send PR to improve the docs. As I said, “assume” means feedback is welcome when assumption is wrong. It also means, since we assume wrong, we need help to make it right.
I have a .NET client successfully doing STT with the default 0.9.3 model and scorer, so it definitely seems like I can get this to work! The code is listening to transmissions on a VOIP server, decoding to PCM and throwing it at DeepSpeech. I get about 187%-of-realtime inference speed, on a 5ghz 8086K, CPU inference. Not bad at all and certainly enough to keep up with the VOIP transmissions.
Next is more fiddling with the constrained grammar to improve recognition, and some figuring out of user intent. Any tips on that? I’m going to start with some lo-fi regex stuff but maybe Rasa or another library would be good for that? (Needs a .NET binding if possible).
Then I create a text file with the commands that the bot should be aware of, using as many variations of things people might say as I can think of. This produces some sort of output scorer.
That makes sense for me. I never tried to make an external scorer, but it seems the right way to proceed. May you maybe want to share more step-by-step details about how you solved?
Next is more fiddling with the constrained grammar to improve recognition, and some figuring out of user intent. Any tips on that? I’m going to start with some lo-fi regex stuff but maybe Rasa or another library would be good for that?
some figuring out of user intent is maybe out-of-scope here.
Well, solution is related to the problem, as usual
Is all related about the context of your “narrow-domain” (voice-)bot.
So if you have to manage closed-domain task-oriented conversations, maybe regexp are a fast and common solution. You may want to manage states of conversations. Maybe my opensource dialog manager could give you some ideas.
In case your “intents” are a big number you probably need a intents classifier, RASA is a perfect candidate and if you need speed, I’d suggest:
I’m now playing with fine-tuning, taking the 0.9.3 checkpoint and training it further given some samples that I have. I’ve just realized (based on some other threads) that using the same data for training, validation, and testing will lead to over-fitting (well I guess that was obvious) but am unsure how ‘bad’ this is. I currently have 80 short audio clips to use to fine-tune the model. Obviously more is better, but if I have 100 clips, how many should I use for training vs validation vs testing? What ‘loss’ number is a good result?
Really bad, please read more about deep learning in general or it will be hard for you to get good results. Basically, never use data twice.
Haha, yes well I’m clearly hacking my way around. Thanks for tolerating me
The lack of samples is due to transcription effort. I’m hopeful I can get to hundreds on my own and maybe a thousand if I get promising results and engage some community members in my domain.
If I have a set of N transcribed samples (where N > 200, say), should I always use the same particular subset for training / validation / testing, or should I randomly divide them into buckets of 80%, 15%, 5% ?
Second question: I’m trying to use my own machine and its 3080 GPU. But I’m running into a issues. The 3080 is only supported by recent drivers, and is Cuda 11. But DeepSpeech is using an older version of Tensorflow, 1.15, that wants a two-year old version of Cuda/CudNN. So far I’ve not managed to get the GPU recognized by the older Tensorflow. Can I use a more recent Tensorflow, or would that require changes to DS?
The other suggestion here was “use Google Colab” and although that looks promising I get a lot of apparent incompatibilities between Python versions, libraries that DS wants to use, etc. Is anyone aware of a Colab notebook using DS 0.9.3 that I could take a look at?
Random sounds good, but generally train can be bad, validation should good and test be great data. And data stays in a bucket for the training.
Search around here for 3090. There were some people who did get it to work. Training is still on TF 1.15 unfortunately. There are some ideas recently to switch to TF 2, but nothing concrete.
No, but just follow the docs and you’ll have a notebook in 10 minutes.
Another Colab user mentioned that the virtualenv stuff needs to be activated for each code block (because it’s a new shell, effectively). I’ve done that, but now get an error about a non-writeable directory.
I’ve managed to reproduce where I got to via CPU training, but on Colab using a GPU. This is truly awesome stuff. Now I can sort out my training data properly and train in (hopefully) a reasonable amount of time using Colab.
So, progress. I’ve managed to make a bot that is actually interactive and can (sometimes) understand sentences from users. This is great.
I’m now working on training data. My approach is to use the 0.9.3 checkpoint and then to “fine tune” using more speech samples. I have a process where users can submit their own samples to me, for use in training. They are asked to read some sentences from the bot’s intended vocabulary, and I get a .wav file and a transcript. I also have a bunch of audio samples from the same domain, but that are not people addressing the bot. It’s people talking to each other, for example. I am starting to transcribe these by hand, but I would say they are generally less-good samples.
I have written a script to ‘sprinkle’ the high-quality bot-specific samples into the training, validate and test data sets. I can guarantee that no specific .wav file is ever used more than once, but I have a couple of questions:
Should I ‘save’ my highest quality samples only for testing? Or is dividing them up for use in train/validate ok? What proportions should I use?
I did some reading on machine learning fundamentals, and the goal for test/validate data is to ensure the model is actually generalizing rather than learning specific patterns in its input. When I give it new samples in the validate/test phase, is it enough that it is a new sample (new wav file, something different being said) or should I also try to ensure it is a new voice ? (i.e. a user the model has never heard before at all)
Tests should be as close to real life as possible. Those are usually not best quality. You should use that rather for dev.
We are talking about new sentences. Some would argue new voices don’t hurt. Again, if you have loads of material, split. If you have few fine tune with that and put most realistic in dev.
I’m not sure which training phase you mean by ‘dev’ – is that train, validate or test?
I have the following content:
Carefully recorded phrases from users with good microphone, little background noise. Several hundred sentences for a few dozen users.
Real world recordings, with less clarity, manually transcribed.
What I think you’re telling me is that actually, (1) is more useful for training rather than testing, and that (2) is actually what I want in my test set. I have thousands of samples in (2) and I can divide it however is most beneficial.
These are the outputs from my most recent training run. I know “is this good?” is a relative question, but does this seem like it is producing a useful model yet?
Testing model on voice-training-data/test/cb-test-data-wav.csv
Test epoch | Steps: 1 | Elapsed Time: 0:01:12
Test on voice-training-data/test/cb-test-data-wav.csv - WER: 0.443700, CER: 0.194415, loss: 35.202534