Standard Method for Processing Long Audio Files with 0.3.0/0.4.0 Python Package?

zaptrem · November 7, 2018, 9:24pm

Here’s a snippet of my processing code:

from deepspeech import Model
import scipy.io.wavfile as wav


ds = Model('models/output_graph.pb', 26, 9, 'models/alphabet.txt', 500)


def process(path, id):
    fs, audio = wav.read(path)
    processed_data = ds.stt(audio, fs)
    with open('lectures/' + id + '.txt', 'a') as f:
        f.write(processed_data)
    return processed_data

lissyx · November 7, 2018, 9:22pm

Sorry, this does not answers my question.

zaptrem · November 7, 2018, 9:24pm

Whoops, I added my graph instantiation above.

lissyx · November 7, 2018, 9:32pm

Ok, so .pb, not .pbmm, means huge memory usage, likely not to help in your case. Please check the documentation about the mmap file format.

zaptrem · November 7, 2018, 11:15pm

Thanks! I’ll try that change and report back.

zaptrem · November 11, 2018, 5:48am

This worked! However, I got this result from my short audio file:

“een go at in e an e same is were er to o in tejest bused e bro o or litre an per andifolworesware o t al e o as aner reaerything is bid some min o so o o o e o la oro ah e or ee ingle is ateou ea to his head e ant oo the bot te o hii a se is bo i weo reb e be a arebut the a e o a o bo tha wy om back tothe oher”

Using release 0.3.0 models and packages I’m getting jibberish. Though my input has noise and reverb I’m surprised by the lack of English words. Is this to be expected?

lissyx · November 11, 2018, 11:34am

It really depends on your audio sources, at some point. Can you make sure it’s 16 bits PCM, 16kHz and mono, first? Could you share some sample ?

zaptrem · November 11, 2018, 5:57pm

Here is the snippet: https://drive.google.com/file/d/1guFjgkmwJbi_e5nsWuZeMY5njj9klTGj/view?usp=drivesdk

I used Audacity to convert it 16Khz Mono WAV Win 16bit PCM

lissyx · November 11, 2018, 7:02pm

What was the source ?

zaptrem · November 11, 2018, 7:16pm

Source was a 44khz mp4.

lissyx · November 13, 2018, 4:10pm

mono, stereo? I assume it was mp3. It’s possible your conversion introduced some bad artifacts, but given the output, can we be sure of your exact setup / versions ? The full output should include those informations.

zaptrem · November 13, 2018, 4:31pm

Sorry, source was stereo .m4a . I’m running the latest non-alpha (0.3.0) versions of the Python package and models. As the python code only returns a string with the transcription I’m not sure what “full output” I can provide. If you mean the CLI interface I don’t have access to that as I’m only able to run the program in a specific environment.

lissyx · November 13, 2018, 4:33pm

libdeepspeech.so produces some TensorFlow/DeepSpeech version infos on stderr. We need that.

zaptrem · November 13, 2018, 4:39pm

I’ll look into that. Here is the link to the models I’m downloading in my Dockerfile: https://github.com/mozilla/DeepSpeech/releases/download/v0.3.0/deepspeech-0.3.0-models.tar.gz

lissyx · November 13, 2018, 4:42pm

Except that the version you are running with is really important. …

zaptrem · November 13, 2018, 4:52pm

Alright, so we’re talking about the output after running ./deepspeech-0.3.0-models/libdeepspeech.so ?

zaptrem · November 13, 2018, 7:15pm

Sorry, I haven’t been able to find that file. It seems to be related to the python package in some way (found mentions of Native Client in the git repo)?

lissyx · November 13, 2018, 7:37pm

We package that inside the Python wheel, but you cannot directly execute it.

lissyx · November 13, 2018, 7:40pm

@zaptrem Any call to DS_CreateModel() will print versions informations on stderr: https://github.com/mozilla/DeepSpeech/blob/master/native_client/deepspeech.cc#L356. So you should be able to get it, even with the snippet you posted earlier.

zaptrem · November 22, 2018, 10:08pm

Bit of a delay, but here you go:

TensorFlow: v1.11.0-9-g97d851f
DeepSpeech: v0.3.0-0-gef6b5bd