Only gibberish output

I recently tried out deepspeech. I followed the instructions from the docs:
https://deepspeech.readthedocs.io/en/r0.9/

Installed via pip, downloaded pre-trained english models. My source was a video from a press conference which I converted to a fitting wav file first via ffmpeg:
ffmpeg -i [videofile] -acodec pcm_u8 -ar 16000 out.wav

The video was around 1:15, mostly english language spoken, without much background noise, something I’d expect to produce reasonable output. The process ran for around 25 minutes.

The output looked like this:
“entertainments internationalisation teetotallers teetotallers teetotalers teetotallers oesterreichischer disconsolately specialisation inaccessibleness teetotallers teetotalers teetotallers teetotallers teetotallers teetotallers secessionists etiennette itineraries teetotalers etiennette […]”

So it was just meaningless gibberish (the word “teetotallers” appeared a lot for whatever reason). The whole output was only ~5000 bytes (for a >1h video of spoken language one would likely expect much much more text output).

I think I made some fundamental mistake somewhere that produced useless output, but I have no idea where. Any pointers?

2 Likes