Using pre-trained model

antimac · April 15, 2020, 4:38pm

Hi, I’m using Mozilla/Deepspeech model for my educational purposes, but figuring out that I have some troubles. Model running pretty well on examples I found on for version 0.6.0 ( I’m running 0.6.1). Here is output for one of them, and I’d say it’s correct:

(deepspeech-venv) C:\Diploma\Tech\Deepspeech>deepspeech --model model/output_graph.pbmm --lm model/lm.binary --trie model/trie --audio dataset/audio/2.wav
Loading model from file model/output_graph.pbmm
TensorFlow: v1.14.0-21-ge77504ac6b
DeepSpeech: v0.6.1-0-g3df20fe
2020-04-15 21:37:52.706554: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Loaded model in 0.026s.
Loading language model from files model/lm.binary model/trie
Loaded language model in 0.0191s.
Running inference.
why should one halt on the way
Inference took 2.508s for 2.735s audio file.

I used lm.binary trie files that came with model’s tar.gz.
Trouble is I am getting is on my audio even for 2 seconds saying for example “experience” it struggles and doesn’t recognize. I tried audio saying “here I am”(starts on 1 sec), it doesn’t recognize correct, output was “he eleanora”. Should I build lm.binary files and other files for myself?
Any help would be much appreciated!

lissyx · April 15, 2020, 5:02pm

Please document how you record those, as well as any speech-related feature (your accent, etc.)

antimac · April 15, 2020, 8:10pm

how you record those

I recorded it on my laptop(Windows 10). format is m4a and freq 48mHz, then
I converted to wav and 16mHz by converter.

your accent

I’m from Central Asia.

any speech-related feature

There is no any noise that could be, I recorded example in my room alone.

Do you need example uploaded?

lissyx · April 15, 2020, 8:15pm

So that’s one big source of risks. Can you ensure it’s fitting the requirements of the network? WAV, PCM 16-bits, 16-kHz, mono ?

English speak with Central Asian accent? Sadly, this might be a big source of confusion for the network. Accents diversity is a big problem, Common Voice is one part of the solution.

If you can, try to mimic as much as possible some English US accent, it might help.

antimac · April 16, 2020, 1:19pm

turns out that I didn’t change to mono, and now recognition works better. Added photo how I now convert my audio

thank you a lot

since it worked well but still recognizes some words incorrect, is there anything to do with it?
my “experience” it recognized as “the sperience” and sentence saying “here I am and I want to recognize my speech via deep speech recognition technology” as “here i am and i want to present night my speech the deep speech recognition to foote”.

lissyx · April 16, 2020, 1:45pm

This is very much likely just your accent at that point, I get similar results on the released English model with my french accent.

antimac · April 16, 2020, 3:39pm

Yeah, I’d used if my targeted audio was similar to my examples, but it is not. There is a noise, not big, but, still exists. Anyway, thank you!

lissyx · April 16, 2020, 3:44pm

Some noise might seem not that big but be enough to mess with the transcription. This is something we are trying to address via Common Voice and noise augmentation.

antimac · May 10, 2020, 10:19am

Hi, I have new question here. I wonder if there is a way to write the recognized text to .csv file at the mean time break it apart into several rows which represents every frame for one minute. In other words, I’d be glad to know if there is a way to get an output csv file with two columns “frame_started_time”, and “text_it_contains” while inferencing.
If it is not possible with pretrained models, I’ll go another way, thanks in advance!

othiele · May 10, 2020, 12:25pm

I am not quite sure that I get what you want. You could get the metadata from the native client with the extended flag and write this info yourself:

github.com

mozilla/DeepSpeech/blob/35cbc16697461c2d5963e508e877b83dd4be4bda/native_client/python/client.py#L82


parser.add_argument('--alphabet', required=True,
                    help='Path to the configuration file specifying the alphabet used by the network')
parser.add_argument('--lm', nargs='?',
                    help='Path to the language model binary file')
parser.add_argument('--trie', nargs='?',
                    help='Path to the language model trie file created with native_client/generate_trie')
parser.add_argument('--audio', required=True,
                    help='Path to the audio file to run (WAV format)')
parser.add_argument('--version', action=VersionAction,
                    help='Print version and exits')
parser.add_argument('--extended', required=False, action='store_true',
                    help='Output string from extended metadata')
args = parser.parse_args()


print('Loading model from file {}'.format(args.model), file=sys.stderr)
model_load_start = timer()
ds = Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
model_load_end = timer() - model_load_start
print('Loaded model in {:.3}s.'.format(model_load_end), file=sys.stderr)


if args.lm and args.trie:

reuben · May 10, 2020, 1:20pm

transcribe.py does exactly what you want.

reuben · May 10, 2020, 1:24pm

Note that it uses VAD to split the input file, so by default it will not generate chunks of a fixed duration (like one minute), but you should be able to write some logic to coalesce successive chunks in the output into whatever resolution you prefer.

antimac · May 10, 2020, 3:10pm

may be it is a foolish question, but I want to be sure.

transcribe.py does exactly what you want.

In this case I should push whole deepspeech project and then run the transcribe.py, or it is already in my virtual env when I downloaded it by pip?
Why I don’t just get this .py file? Well, there is references to other .py files such as split_audio_file() refers to feeding.py and so.
and one more: you sure it works on pretrained model?

antimac · May 11, 2020, 2:23pm

why I don’t just make up something with words_from_metadata(metadata) in client.py??? will it cause some troubles?

Topic		Replies	Views
Possible sample files to check basic functionality DeepSpeech	1	405	April 30, 2019
Unable to load pre-trained model DeepSpeech	15	3687	April 6, 2018
Pretrained model did not create any results DeepSpeech	3	399	July 8, 2019
Issue with Language Model DeepSpeech	11	1035	January 3, 2019
Language Model during training effect DeepSpeech	6	1333	August 15, 2019

Using pre-trained model

Related topics