How to guide for running my own voice-web

Hi,

I want to host my own voice-web to gather domain specific voices to further train an instance of DeepSpeech.

Is there a guide to getting it running?

Thanks

Ben

The same github repo you linked should have all information about how to setup your own instance.

I’m curious, what do you mean by “domain specific”?

Cheers.

Hi,

I’ve tried some test sentences against the pre-trained model version 0.5.1.

While some of the words are correctly resolved some are not. For instance ‘X11’ is resolved to ‘excellence’ and I consider ‘X11’ as domain specific. I think that some of the other errors are due to the model being trained with mostly American english and my project will be used in the UK.

If I test the same file with https://cloud.google.com/speech-to-text/ and language set to English (Great Britain) then it resolves correctly. Setting the language to English (United States) shows similar errors.

So I’m thinking that I need to further train the pre-trained model with English (Great Britain) audio which will include domain specific words to get the accuracy up.

I don’t really know if “X11” is an issues about accent. The current English model is not perfect and still needs more training data to improve.

Over Common Voice we collect accent data to improve this and we can use the main site to encourage more people from the UK to donate their voice, so you can then train a model that will provide more accuracy for certain accents we are currently missing.

@kdavis can probably provide more details on the current deep speech English models and our plans to improve them.

@benshort The current DeepSpeech model is built up largely from American voice data so it is biased towards American (male) accents. Also worth noting that the 0.5.1 model does not include Common Voice data due to an oversight, so it may be further skewed in that regard.

It’s probably better to encourage British people to contribute to Common Voice though, rather than setting up your own version. That way everyone can benefit from the data.

The alphabet output from deep speech is in the alphabet.txt file which contains no digits, i.e. 1,2,3…9,0 are not output from deep speech. So, X11 will never be output.

If you want to output digits you have at least these two options

  1. Create a post processor that converts from “x eleven” to “X11”
  2. Train a new model with digits in the alphabet

My guess is that Google does what’s described in the first option.

@kdavis Yeah I think we would might want to add an Inverse Text Normalization stack onto DeepSpeech, this way it wouldn’t be a bottleneck in the training process.

Do you know if the 0.5.0 model was trained with the Common Voice data?

A post processor would look at the resolved text and replace spelt numbers with numeric numbers before passing the text to whatever logic needs to deal with it?

0.5.0 and 0.5.1 are actually the same model. It was just a bug fix release. However, 0.4.x was trained on Common Voice, but that was a while ago with less data.

Yep, just do a find and replace on the transcript.