I recently tried out deepspeech. I followed the instructions from the docs:
Installed via pip, downloaded pre-trained english models. My source was a video from a press conference which I converted to a fitting wav file first via ffmpeg:
ffmpeg -i [videofile] -acodec pcm_u8 -ar 16000 out.wav
The video was around 1:15, mostly english language spoken, without much background noise, something I’d expect to produce reasonable output. The process ran for around 25 minutes.
The output looked like this:
“entertainments internationalisation teetotallers teetotallers teetotalers teetotallers oesterreichischer disconsolately specialisation inaccessibleness teetotallers teetotalers teetotallers teetotallers teetotallers teetotallers secessionists etiennette itineraries teetotalers etiennette […]”
So it was just meaningless gibberish (the word “teetotallers” appeared a lot for whatever reason). The whole output was only ~5000 bytes (for a >1h video of spoken language one would likely expect much much more text output).
I think I made some fundamental mistake somewhere that produced useless output, but I have no idea where. Any pointers?