I’ve been playing with Deep Speech the past couple of days with the task of transcribing University lectures, that were presented in English by German lecturers. I tried to use the models provided with the 0.6.1 release (pb, pbmn and tflite). So far the transcriptions have been far from accurate. What is the best way to tackle this issue?
Can I adjust some of the parameters to try to get better results?
Should a train a new model based on germans speaking English?
Where can I find additional information about the following parameters:
–beam_width BEAM_WIDTH Beam width for the CTC decoder
–lm_alpha LM_ALPHA Language model weight (lm_alpha)
–lm_beta LM_BETA Word insertion bonus (lm_beta)
If I should create a new model, where could I get the data? Is it possible to filter only german speaking English from the common voice initiative?
Thank you in advance!
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
There could be a number of reasons, ranging from how you get the audio, how noisy / quality of the sound to speakers themselves.
How much far from accurate are we speaking ? What is wrong exactly ?-
Have you had a look at python DeepSpeech.py --help ?
That might be one of the problem.
There’s an accent information, so you should be able to filter
And this is the transcription:
deepspeech: error: the following arguments are required: --model, --audio
(base) 83adc305:DeepSpeech_install orlo$ deepspeech --model deepspeech-0.6.1-models/output_graph.pb --audio audio/prueba2.wav --lm deepspeech-0.6.1-models/lm.binary --trie deepspeech-0.6.1-models/trie
Loading model from file deepspeech-0.6.1-models/output_graph.pb
TensorFlow: v1.14.0-21-ge77504ac6b
DeepSpeech: v0.6.1-0-g3df20fe
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
Loaded model in 1.26s.
Loading language model from files deepspeech-0.6.1-models/lm.binary deepspeech-0.6.1-models/trie
Loaded language model in 0.000468s.
Running inference.
it was fittingly raising on a mat at the didnot domain you on and on paranoic satiating anointing for you you went on o old age was ever to get pretentious for the testator fosterville in hades himenes and for you are a young policy cannot on grathe true policy so in order to overcome this problem what we oughta do is that we consider hunting light to her the policy and what is the antipodes as the athenian the one indicates an this is what meditator the border imposing policies so this was policy valuation and not recount to polite one so policy is always want to caupolican elevation one policy veneration that we saw a bonaventure is no more an economic droning because on superstition a gas the picton riding at an hour in detaining steady repute become rusty in formation for the environment what a lontaine in one oneadatote do any betise to steps witality by ration as polite and women be portable poseidonia being by montalembert and after watch the real policy so the collocation uselin our policies readywitted patriotically but all the asometon batalha in more and polyanthus in him there he saw one thing that we have to think about the invention to before i know that all of timepiece but we are one that agoing porteous be eloise if we not explore the economies the estimates for days and then we may not offind a optimae paniscus he never find wood states for we have produced with this amalekite ray on and there are maitresse can consider so one is quite one pointing the other one is in oregon
Inference took 113.055s for 169.775s audio file.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
I used Izotope RX, do you have a suggestion of a better tool?
(thanks for the help)
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
8
Listening to this:
volume seems low, we got reports / experience that it might impair recognition
I hear a lot of background noise, this might for sure impair
the speaker really has german accent, our model being trained mostly on american english accents, it’s not impossible this is adding to the previous
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
no idea what that is, we just rely on sox usually, pcm 16 signed little-endian, 16 bits 16kHz.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
10
@orbruno Also, given the quick listening, as you said it’s university lectures, it’s very possible the vocabulary is quite specific and our current generic language model is not good on that.
The main idea is the possibility to integration with the open-source oppencast video management software for lecture recording in universities.
What do you think about the path to follow?
Train a new model, then find new language models. How much data (or many hours) is needed for such training?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
12
No, but it’s easy to build one.
How widespread is this software ? I know people working on eSup’s Pod have got interesting results, but on french speakers speaking french cc @ptitloup
Hard to tell like that.
Maybe get some text corpus of universities lecture in the fields, a few hundred MB of text, and add that into the language model to see the impact.
Maybe at first tune the audio (blank / background noise, volume), there might be a lot of improvements already from that.
For starters, you could use MS/Google/AWS and see how they perform. If it’s really good, you could use that to train your own model. Don’t know if thats legally fine, but probably ok for edu use. He really has a strong German accent, will be hard to find good training material
Izotope make good tools - I’m sure the conversion is high quality.
@orbruno Note that the DeepSpeech models were trained on the June 2019 dataset release. A new one just came out which has around 400 extra hours. You could filter the new clips and train on those, or just extract the German accent ones.
Great, thanks for the info… I guess you are talking about the common voice data set? is the en_1488h_2019-12-10 the latest on?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
17
Yes, it is. I don’t know how much data there is for German people speaking English. Maybe it’s time to spread the word of getting people to contribute to Common Voice in English with their German accent