Using pre-trained model

Hi, I’m using Mozilla/Deepspeech model for my educational purposes, but figuring out that I have some troubles. Model running pretty well on examples I found on for version 0.6.0 ( I’m running 0.6.1). Here is output for one of them, and I’d say it’s correct:

(deepspeech-venv) C:\Diploma\Tech\Deepspeech>deepspeech --model model/output_graph.pbmm --lm model/lm.binary --trie model/trie --audio dataset/audio/2.wav
Loading model from file model/output_graph.pbmm
TensorFlow: v1.14.0-21-ge77504ac6b
DeepSpeech: v0.6.1-0-g3df20fe
2020-04-15 21:37:52.706554: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Loaded model in 0.026s.
Loading language model from files model/lm.binary model/trie
Loaded language model in 0.0191s.
Running inference.
why should one halt on the way
Inference took 2.508s for 2.735s audio file.

I used lm.binary trie files that came with model’s tar.gz.
Trouble is I am getting is on my audio even for 2 seconds saying for example “experience” it struggles and doesn’t recognize. I tried audio saying “here I am”(starts on 1 sec), it doesn’t recognize correct, output was “he eleanora”. Should I build lm.binary files and other files for myself?
Any help would be much appreciated!

Please document how you record those, as well as any speech-related feature (your accent, etc.)

how you record those

I recorded it on my laptop(Windows 10). format is m4a and freq 48mHz, then
I converted to wav and 16mHz by converter.

your accent

I’m from Central Asia.

any speech-related feature

There is no any noise that could be, I recorded example in my room alone.

Do you need example uploaded?

So that’s one big source of risks. Can you ensure it’s fitting the requirements of the network? WAV, PCM 16-bits, 16-kHz, mono ?

English speak with Central Asian accent? Sadly, this might be a big source of confusion for the network. Accents diversity is a big problem, Common Voice is one part of the solution.

If you can, try to mimic as much as possible some English US accent, it might help.

turns out that I didn’t change to mono, and now recognition works better. Added photo how I now convert my audio


thank you a lot :slight_smile:
since it worked well but still recognizes some words incorrect, is there anything to do with it?
my “experience” it recognized as “the sperience” and sentence saying “here I am and I want to recognize my speech via deep speech recognition technology” as “here i am and i want to present night my speech the deep speech recognition to foote”.

This is very much likely just your accent at that point, I get similar results on the released English model with my french accent.

Yeah, I’d used if my targeted audio was similar to my examples, but it is not. There is a noise, not big, but, still exists. Anyway, thank you!

Some noise might seem not that big but be enough to mess with the transcription. This is something we are trying to address via Common Voice and noise augmentation.

Hi, I have new question here. I wonder if there is a way to write the recognized text to .csv file at the mean time break it apart into several rows which represents every frame for one minute. In other words, I’d be glad to know if there is a way to get an output csv file with two columns “frame_started_time”, and “text_it_contains” while inferencing.
If it is not possible with pretrained models, I’ll go another way, thanks in advance!

I am not quite sure that I get what you want. You could get the metadata from the native client with the extended flag and write this info yourself:

transcribe.py does exactly what you want.

Note that it uses VAD to split the input file, so by default it will not generate chunks of a fixed duration (like one minute), but you should be able to write some logic to coalesce successive chunks in the output into whatever resolution you prefer.

may be it is a foolish question, but I want to be sure.

transcribe.py does exactly what you want.

In this case I should push whole deepspeech project and then run the transcribe.py, or it is already in my virtual env when I downloaded it by pip?
Why I don’t just get this .py file? Well, there is references to other .py files such as split_audio_file() refers to feeding.py and so.
and one more: you sure it works on pretrained model?

why I don’t just make up something with words_from_metadata(metadata) in client.py??? will it cause some troubles?