English Deep Speech with German accent in

I’ve been playing with Deep Speech the past couple of days with the task of transcribing University lectures, that were presented in English by German lecturers. I tried to use the models provided with the 0.6.1 release (pb, pbmn and tflite). So far the transcriptions have been far from accurate. What is the best way to tackle this issue?

Can I adjust some of the parameters to try to get better results?

Should a train a new model based on germans speaking English?

Where can I find additional information about the following parameters:

–beam_width BEAM_WIDTH Beam width for the CTC decoder
–lm_alpha LM_ALPHA Language model weight (lm_alpha)
–lm_beta LM_BETA Word insertion bonus (lm_beta)

If I should create a new model, where could I get the data? Is it possible to filter only german speaking English from the common voice initiative?

Thank you in advance!

There could be a number of reasons, ranging from how you get the audio, how noisy / quality of the sound to speakers themselves.

How much far from accurate are we speaking ? What is wrong exactly ?-

Have you had a look at python DeepSpeech.py --help ?

That might be one of the problem.

There’s an accent information, so you should be able to filter

Far enough, most of the words are not correct.

This is the audio (I change the sample rate to 16Khz and 16bit Mono): https://github.com/orbruno/audioexample/blob/master/prueba2.wav

And this is the transcription:
deepspeech: error: the following arguments are required: --model, --audio

(base) 83adc305:DeepSpeech_install orlo$ deepspeech --model deepspeech-0.6.1-models/output_graph.pb --audio audio/prueba2.wav --lm deepspeech-0.6.1-models/lm.binary --trie deepspeech-0.6.1-models/trie 

Loading model from file deepspeech-0.6.1-models/output_graph.pb

TensorFlow: v1.14.0-21-ge77504ac6b

DeepSpeech: v0.6.1-0-g3df20fe

Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.

Loaded model in 1.26s.

Loading language model from files deepspeech-0.6.1-models/lm.binary deepspeech-0.6.1-models/trie

Loaded language model in 0.000468s.

Running inference.

it was fittingly raising on a mat at the didnot domain you on and on paranoic satiating anointing for you you went on o old age was ever to get pretentious for the testator fosterville in hades himenes and for you are a young policy cannot on grathe true policy so in order to overcome this problem what we oughta do is that we consider hunting light to her the policy and what is the antipodes as the athenian the one indicates an this is what meditator the border imposing policies so this was policy valuation and not recount to polite one so policy is always want to caupolican elevation one policy veneration that we saw a bonaventure is no more an economic droning because on superstition a gas the picton riding at an hour in detaining steady repute become rusty in formation for the environment what a lontaine in one oneadatote do any betise to steps witality by ration as polite and women be portable poseidonia being by montalembert and after watch the real policy so the collocation uselin our policies readywitted patriotically but all the asometon batalha in more and polyanthus in him there he saw one thing that we have to think about the invention to before i know that all of timepiece but we are one that agoing porteous be eloise if we not explore the economies the estimates for days and then we may not offind a optimae paniscus he never find wood states for we have produced with this amalekite ray on and there are maitresse can consider so one is quite one pointing the other one is in oregon

Inference took 113.055s for 169.775s audio file.

How you do that might introduce artifacts.

I used deepspeech.py -h, is that the one you were refering to?

I used Izotope RX, do you have a suggestion of a better tool?

(thanks for the help)

Listening to this:

  • volume seems low, we got reports / experience that it might impair recognition
  • I hear a lot of background noise, this might for sure impair
  • the speaker really has german accent, our model being trained mostly on american english accents, it’s not impossible this is adding to the previous

no idea what that is, we just rely on sox usually, pcm 16 signed little-endian, 16 bits 16kHz.

@orbruno Also, given the quick listening, as you said it’s university lectures, it’s very possible the vocabulary is quite specific and our current generic language model is not good on that.

Are there other language models available?

The main idea is the possibility to integration with the open-source oppencast video management software for lecture recording in universities.

What do you think about the path to follow?
Train a new model, then find new language models. How much data (or many hours) is needed for such training?

No, but it’s easy to build one.

How widespread is this software ? I know people working on eSup’s Pod have got interesting results, but on french speakers speaking french cc @ptitloup

Hard to tell like that.

Maybe get some text corpus of universities lecture in the fields, a few hundred MB of text, and add that into the language model to see the impact.

Maybe at first tune the audio (blank / background noise, volume), there might be a lot of improvements already from that.

2 Likes

For starters, you could use MS/Google/AWS and see how they perform. If it’s really good, you could use that to train your own model. Don’t know if thats legally fine, but probably ok for edu use. He really has a strong German accent, will be hard to find good training material

1 Like

Izotope make good tools - I’m sure the conversion is high quality.

@orbruno Note that the DeepSpeech models were trained on the June 2019 dataset release. A new one just came out which has around 400 extra hours. You could filter the new clips and train on those, or just extract the German accent ones.

1 Like

Great, thanks for the info… I guess you are talking about the common voice data set? is the en_1488h_2019-12-10 the latest on?

Great, thanks for the info… I guess you are talking about the common voice data set? is the en_1488h_2019-12-10 the latest on?

Yes, it is. I don’t know how much data there is for German people speaking English. Maybe it’s time to spread the word of getting people to contribute to Common Voice in English with their German accent :slight_smile:

1 Like