Standard Method for Processing Long Audio Files with 0.3.0/0.4.0 Python Package?

zaptrem · November 2, 2018, 4:02am

I’ve been working on a small project involving Python and long audio files but have been unable to track down Python-package-specific documentation (ala https://pillow.readthedocs.io/en/5.3.x/ ) related to non-streaming use cases. Is there a standard method/class/process for processing larger audio files that doesn’t involve slicing them up into smaller ones?

I read a few articles mentioning the splicing method but this is difficult with my use case and the articles are months/years old and don’t take into account this development which mentions more efficient processing of longer files.

Any help you can provide would be much appreciated!

zaptrem · November 7, 2018, 11:20am

Bump (is that allowed here?)

lissyx · November 7, 2018, 12:30pm

Can you document “larger audio files” ? How much are you talking about ? Have you tested and getting bad results ?

zaptrem · November 7, 2018, 6:11pm

My demo was was a 30-60 second clip (part of a 1hr clip I planned on eventually converting). It ran for about 30 minutes before I cut it off. I can make up a snippet similar to my code if you think it would be useful.

lissyx · November 7, 2018, 9:03pm

Wait, a 30-60 seconds file was not decoded after 30 mins? Can you share details on the hardware ?

zaptrem · November 7, 2018, 9:11pm

Generic google cloud compute engine instance with 3.75gb ram and 2 haswell cores

lissyx · November 7, 2018, 9:13pm

Did you used the mmap() capable file, output_graph.pbmm ? If not, you might have been limited by the memory.

zaptrem · November 7, 2018, 9:24pm

Here’s a snippet of my processing code:

from deepspeech import Model
import scipy.io.wavfile as wav


ds = Model('models/output_graph.pb', 26, 9, 'models/alphabet.txt', 500)


def process(path, id):
    fs, audio = wav.read(path)
    processed_data = ds.stt(audio, fs)
    with open('lectures/' + id + '.txt', 'a') as f:
        f.write(processed_data)
    return processed_data

lissyx · November 7, 2018, 9:22pm

Sorry, this does not answers my question.

zaptrem · November 7, 2018, 9:24pm

Whoops, I added my graph instantiation above.

lissyx · November 7, 2018, 9:32pm

Ok, so .pb, not .pbmm, means huge memory usage, likely not to help in your case. Please check the documentation about the mmap file format.

zaptrem · November 7, 2018, 11:15pm

Thanks! I’ll try that change and report back.

zaptrem · November 11, 2018, 5:48am

This worked! However, I got this result from my short audio file:

“een go at in e an e same is were er to o in tejest bused e bro o or litre an per andifolworesware o t al e o as aner reaerything is bid some min o so o o o e o la oro ah e or ee ingle is ateou ea to his head e ant oo the bot te o hii a se is bo i weo reb e be a arebut the a e o a o bo tha wy om back tothe oher”

Using release 0.3.0 models and packages I’m getting jibberish. Though my input has noise and reverb I’m surprised by the lack of English words. Is this to be expected?

lissyx · November 11, 2018, 11:34am

It really depends on your audio sources, at some point. Can you make sure it’s 16 bits PCM, 16kHz and mono, first? Could you share some sample ?

zaptrem · November 11, 2018, 5:57pm

Here is the snippet: https://drive.google.com/file/d/1guFjgkmwJbi_e5nsWuZeMY5njj9klTGj/view?usp=drivesdk

I used Audacity to convert it 16Khz Mono WAV Win 16bit PCM

lissyx · November 11, 2018, 7:02pm

What was the source ?

zaptrem · November 11, 2018, 7:16pm

Source was a 44khz mp4.

lissyx · November 13, 2018, 4:10pm

mono, stereo? I assume it was mp3. It’s possible your conversion introduced some bad artifacts, but given the output, can we be sure of your exact setup / versions ? The full output should include those informations.

zaptrem · November 13, 2018, 4:31pm

Sorry, source was stereo .m4a . I’m running the latest non-alpha (0.3.0) versions of the Python package and models. As the python code only returns a string with the transcription I’m not sure what “full output” I can provide. If you mean the CLI interface I don’t have access to that as I’m only able to run the program in a specific environment.

lissyx · November 13, 2018, 4:33pm

libdeepspeech.so produces some TensorFlow/DeepSpeech version infos on stderr. We need that.