Hi, I’m using Mozilla/Deepspeech model for my educational purposes, but figuring out that I have some troubles. Model running pretty well on examples I found on for version 0.6.0 ( I’m running 0.6.1). Here is output for one of them, and I’d say it’s correct:
(deepspeech-venv) C:\Diploma\Tech\Deepspeech>deepspeech --model model/output_graph.pbmm --lm model/lm.binary --trie model/trie --audio dataset/audio/2.wav
Loading model from file model/output_graph.pbmm
TensorFlow: v1.14.0-21-ge77504ac6b
DeepSpeech: v0.6.1-0-g3df20fe
2020-04-15 21:37:52.706554: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
Loaded model in 0.026s.
Loading language model from files model/lm.binary model/trie
Loaded language model in 0.0191s.
Running inference.
why should one halt on the way
Inference took 2.508s for 2.735s audio file.
I used lm.binary trie files that came with model’s tar.gz.
Trouble is I am getting is on my audio even for 2 seconds saying for example “experience” it struggles and doesn’t recognize. I tried audio saying “here I am”(starts on 1 sec), it doesn’t recognize correct, output was “he eleanora”. Should I build lm.binary files and other files for myself?
Any help would be much appreciated!
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
2
Please document how you record those, as well as any speech-related feature (your accent, etc.)
I recorded it on my laptop(Windows 10). format is m4a and freq 48mHz, then
I converted to wav and 16mHz by converter.
your accent
I’m from Central Asia.
any speech-related feature
There is no any noise that could be, I recorded example in my room alone.
Do you need example uploaded?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
4
So that’s one big source of risks. Can you ensure it’s fitting the requirements of the network? WAV, PCM 16-bits, 16-kHz, mono ?
English speak with Central Asian accent? Sadly, this might be a big source of confusion for the network. Accents diversity is a big problem, Common Voice is one part of the solution.
If you can, try to mimic as much as possible some English US accent, it might help.
thank you a lot
since it worked well but still recognizes some words incorrect, is there anything to do with it?
my “experience” it recognized as “the sperience” and sentence saying “here I am and I want to recognize my speech via deep speech recognition technology” as “here i am and i want to present night my speech the deep speech recognition to foote”.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
This is very much likely just your accent at that point, I get similar results on the released English model with my french accent.
Yeah, I’d used if my targeted audio was similar to my examples, but it is not. There is a noise, not big, but, still exists. Anyway, thank you!
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
8
Some noise might seem not that big but be enough to mess with the transcription. This is something we are trying to address via Common Voice and noise augmentation.
Hi, I have new question here. I wonder if there is a way to write the recognized text to .csv file at the mean time break it apart into several rows which represents every frame for one minute. In other words, I’d be glad to know if there is a way to get an output csv file with two columns “frame_started_time”, and “text_it_contains” while inferencing.
If it is not possible with pretrained models, I’ll go another way, thanks in advance!
Note that it uses VAD to split the input file, so by default it will not generate chunks of a fixed duration (like one minute), but you should be able to write some logic to coalesce successive chunks in the output into whatever resolution you prefer.
In this case I should push whole deepspeech project and then run the transcribe.py, or it is already in my virtual env when I downloaded it by pip?
Why I don’t just get this .py file? Well, there is references to other .py files such as split_audio_file() refers to feeding.py and so.
and one more: you sure it works on pretrained model?