Deep Speech 0.5.0 Pre-Release Model + Checkpoint

Just wanted to give you all peek at the pre-release Deep Speech 0.5.0 English model and checkpoint before we make the coming general release.

Our plan is to do the 0.5.0 release on June 11th, but we thought it might be nice if you could get your hands on the model and “kick the tires” as it were to check performance for your particular use case and see how things are working. If you have comments on the release, please leave them in this thread. If you find a bug, please leave a GitHub issue.

The model and checkpoint should function with master or any one of the later, e.g. v0.5.0-alpha.11, releases.

The checkpoint is here deepspeech-0.5.0-checkpoint.tar.gz and the model is here deepspeech-0.5.0-models.tar.gz

8 Likes

Thanks for posting this :slight_smile:

I just set up a quick Conda environment and tried out the model with some test recordings I have and it seemed pretty good.

Whilst the results weren’t 100%, my audio was fairly quickly spoken and worse than that it was 22.5kHz and therefore I’d say it was pretty reasonable (plus perhaps my English accent may be throwing it off a little?) I’ll dig out some clearer cases, convert them to 16kHz and test it a bit further later on.

Results

Expected: this will make certain that the largest possible portion of the funds alloted
Output: this will make certain the largest possible portion of the fantastic

Expected: after his bankruptcy he obtained a place as clerk in the Great Northern railway office
Output: after his bankruptcy he obtained a place as clerk in the great bar than railway office

I jotted down how I did it in this gist: https://gist.github.com/nmstoker/c30a833fc697a21f2f385a82e0d1172c

Looking forward to the full release soon :tada:

thank you guys I was waiting for this from long time.

Played about with the 0.5 model today with the latest code on master. It’s a mixed bag on my test files.

Here’s what improved:

Transcript:

transparent pricing streamlined purchase a three day worry free exchange and test drives that come to you

0.4.1:

anarchising stream line purchase a three day worry for exchange test drives that come to you

0.5:

transparent rising stream line purchase a three day worry free exchange and test drives that come to you

Transcript:

before we begin, we’re very excited to announce that our new book is coming out it’s available for pre order right now it comes out october the eighteenth we would love it if you bought it

0.4.1:

before we begin were very excited to announce that our new book is coming out its available for pre order right now it comes out all torothee we would love it if you boat it

0.5:

before we begin were very excited to announce that our new book is coming out it’s available for pre order right now he comes out alterity eighteenth we would love it if you bought it

Transcript:

if you wanna know for example why they banned false beards in copenhagen that’s in the book if you wanna find out what method was used to get those thai football team kids out of that cave

0.4.1:

if ye want to know for example why they band pulsebeats in copenhagen that’s in the book if you want to find now what method was used to get those tifoon kids out of that cave

0.5:

if you want to know for example why they band false beards in copenhagen that’s in the book if you want to find out what method was used to get those tie football team kids out of that cave

What got worse:

Transcript:

and now i am joined live in the studio by the prime minister theresa may good morning prime minister morning andrew um can we agree to start with that the one thing that voters deserve in what you yourself has said is going to be a very very important election is no sound bites

0.4.1:

and now i am joined life in the studio by the prime minister teresa make good morning from lin enter and can we agree to start with it the one thing the voters deserving what you yourself he said is going to be a very very important election is no son to bite

0.5:

and now i am joined live in the studio by the prime minister to resume or morning from going into and can we agreed to start with the one thing that vote has deserved in what you yourself he said is going to be a very very important election is no son to bits

Transcript:

in five days from now, MPs will vote on the brexit deal the vote that will not only decide britain’s future in europe but the future of the prime minister

0.4.1:

in five days from now and peace will vote on the break it deal the boat the will not only decide britons future in europe of the future of the prime minister

0.5:

in five days from now this will boat on the break it deal the vote that will not only decide briton’s future in europe at the future of the prime minister

And this one just turned into gobbledegook:

Transcript:

your royal highness meghan markle congratulations to you both thank you can we start with the proposal and the actual moment of your engagement when did it happen how did it happen

0.4.1:

for all highness than i can well coagulations to both and you care i was sold with the proposal and the actual woman your engagement wanted at an how to the happen

0.5:

moral highness and me amalgamations to both and her oneself with the proposal and the acumen for ingagement wended out on honedale

So in summary, the model does seem better at the nuances between similar words (e.g. “bought” vs “boat”) but in some cases it is producing worse output than 0.4.1.

(Note: when comparing two incorrect transcripts, I consider one “better” if it is phonetically closer to the correct word - e.g. “and peace” is closer to “MPs” than “this”.)

I’ve done some additional testing and the pattern seems to be that files with a fairly low WER in 0.4.1 do better on 0.5, while files with higher error rates now do worse. Files with the highest WERs now become completely nonsensical.

@kdavis Hi…
i used to fine tune over 0.4.1 checkpint
thanks to you… now I am finetuning over 0.5.0 checkpoints.
it has started training now
but i am not getting info on how long will it take to complete the epoch i.e ETA … is there any flag i missed
my output is
>

I Restored variables from most recent checkpoint at deepspeech-0.5.0-checkpoint/model.v0.5.0, step 467356
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 1:01:58 | Steps: 158 | Loss: 24.095928

and also in 0.4.1 epoch started from some value other than 0 in fine tuning but this time it started with 0.??

Those are both expected changes. Due to the new tf.data based input pipeline, we don’t know the size of the dataset upfront. We could count it in epoch 0 and use it afterwards, but we didn’t feel like it was very useful anyway, given our use of curriculum learning which throws off the ETA values until you’re very close to the end of the epoch.

The epoch count is now always relative, since there was a lot of confusion with the old setup where it would recalculate the epoch number from the combination of trained steps + dataset size + batch size + number of GPUs. Now things should always be predictable. If you want to train for N epochs, regardless of fine tuning or not, specify --epochs N. For keeping track of how long each checkpoint has been trained for, you can look at the step count.

The v0.5.0 release is now out: https://github.com/mozilla/DeepSpeech/releases/v0.5.0

1 Like

@lissyx @reuben @kdavis Do you have any thoughts on this? Is this expected or is it a regression? The ones that turn into nonsense are particularly concerning to me because it’s not even detecting the correct number of words or word boundaries. One of my test files had an entire sentence transcribed as just “a” whereas in 0.4.1 it at least appeared as a sentence with a couple of words correct.