DeepSpeech trained on the Maori Language

First off, huge thanks to the mozilla team for this work. If there’s any information we can provide that would be of use to you, just let us know.

We have recently trained DeepSpeech on Te Reo Maori (The Maori Language - of the indigenous people of New Zealand).

We had a collection of 1,300 speakers and over 193,000 recordings totaling over 300 hours of recorded audio.

Our text corpus was small, 10-20MB at the moment (we are planning to increase this first during the next leg of the project).

Using this, we achieved a 14.0% word error rate on a test set consisting of 13% of all of the recorded audio (~27,000 recordings).

We made a distinction between speakers and sentences in the heldout set, and found DeepSpeech achieved a 13.8% word error rate on sentences not included in training (but spoken by speakers included in the training set).

The word error rate on sentences included in the training, spoken by speakers not in the training set was 6.2%.

Our model was trained on AWS using a p3.8xlarge instance in 18 hours, with the following hyperparameters:

lm_weight = 2.00
epoch = 1
learning_rate = 0.0001
max_to_keep = 3
display_step = 0
validation_step = 0
dropout_rate = 0.30
default_stddev = 0.046875
early_stop = 1
earlystop_nsteps = 10
log_level = 0
summary_secs = 120
fulltrace = 1
limit_train = 0
limit_dev = 0
limit_test = 0
valid_word_count_weight = 1
checkpoint_secs = 600
max_to_keep = 1
train_batch_size = 16
dev_batch_size = 16
test_batch_size = 16
checkpoint_secs = 600
summary_sec = 600
max_to_keep = 10

Everything was run inside of docker images, and we used nvidia docker to deploy the model on gpu machines. The hyperparameters were logged by our continuous integration system, Gorbachev.

We also have a more detailed report on the project, and might be able to share it (after some minor editing) if anyone else is interested.


Awesome! I’d definitely be interested in reading more in you report.

1 Like

Great summary, thanks for that.

One question here - was it trained on 0.1.1 model or on the new streaming 0.2.0 model?

(Same guy as OP here - just sorted out github sign in)

We were running 0.2.0-alpha.7 with minor modifications.

At the moment, I’m not able to confirm if we actually managed to get a streaming model as a result… I haven’t figured how to make use of it yet, so in deployment it’s currently being combined with a segmenter we also wrote.

In any case, it’s working just fine for now. We are definitely interested in getting the streaming use-case sorted out though.

I was tracking closely behind the mozilla team while I was training the model, and started on 0.2.0-alpha.9 but I found myself with a trained model without a way to run inference because it hadn’t been written yet.

So I had to toss out those trains and roll back to alpha.7 where I was able to run inference. Now that 0.2.0 is out of alpha, these probably won’t be problems for you anymore.

Glad to hear that you are able to run the DeepSpeech model and get good results!

I am also trying to run the model using docker container but I am stuck in passing the input. My docker container is up and running. Could you please point how can I pass the input (.wav file)?

You’ll need to provide more information about your docker image, ideally your Dockerfile.

  1. do you have models inside your docker as well or have you mounted them?
  2. when you run the docker, what’s the command that’s started - is it the deepspeech client just awaiting the input wav or is it just an environment with deepspeech installed in it?

Also this looks like a separate topic to the training of Maori language - how about starting a new topic, something like “How to run deepspeech inference in docker”.

Thanks Yv001, for the response.

Yes, I have created a separate thread. Below are the answers to your queries:

  1. Yes, I have the Deep Speech model inside the docker container. And I am successfully to concert .wav input to text, but only when I am inside the container. I am stuck in providing the input and running the command from outside the container i.e. the host.

  2. When the run the docker container, it’s just the environment that is ready. We still need to provide the run the command again when we have a new input.

Is there is any way, where we just need to provide the input, instead of running the command again and again.

Please guide, how we can solve this issue. You can find more information in the above thread, that I mentioned.

As I read your comment, it looks like there are 2 ways to understand what you might mean.

The first is that you’re asking how to run docker commands without having to type docker run [...] into the command line every time.

If I’m understanding you correctly, we use a makefile to manage running all of our scripts inside of the docker container.

Basically you can assign the docker run string to a variable, and define shorthand commands that run arbitrary commands in the docker container by prepending the commands with the docker run string.

The other possible meaning is that you’re asking how to transcribe multiple files without having to call the deepspeech bindings from the command line every time. The answer to that is to write your own version of ( to get transcriptions from your input files.

I’ve noticed that this way is much faster than calling deepspeech from the command line, since each time you do this the tensorflow graph variables have to be initialised from the disk.

Or maybe your problem is both of these problems, in which case I’d suggest writing your own version of, and then running it from a makefile.

Or possibly you’re thinking of something completely different than what I’ve addressed. I can give you more detail about either of my answers if you can let me know more about your problem.