Standard Method for Processing Long Audio Files with 0.3.0/0.4.0 Python Package?

I’ve been working on a small project involving Python and long audio files but have been unable to track down Python-package-specific documentation (ala https://pillow.readthedocs.io/en/5.3.x/ ) related to non-streaming use cases. Is there a standard method/class/process for processing larger audio files that doesn’t involve slicing them up into smaller ones?

I read a few articles mentioning the splicing method but this is difficult with my use case and the articles are months/years old and don’t take into account this development which mentions more efficient processing of longer files.

Any help you can provide would be much appreciated!

Bump (is that allowed here?)

Can you document “larger audio files” ? How much are you talking about ? Have you tested and getting bad results ?

My demo was was a 30-60 second clip (part of a 1hr clip I planned on eventually converting). It ran for about 30 minutes before I cut it off. I can make up a snippet similar to my code if you think it would be useful.

Wait, a 30-60 seconds file was not decoded after 30 mins? Can you share details on the hardware ?

Generic google cloud compute engine instance with 3.75gb ram and 2 haswell cores

Did you used the mmap() capable file, output_graph.pbmm ? If not, you might have been limited by the memory.

Here’s a snippet of my processing code:

from deepspeech import Model
import scipy.io.wavfile as wav


ds = Model('models/output_graph.pb', 26, 9, 'models/alphabet.txt', 500)


def process(path, id):
    fs, audio = wav.read(path)
    processed_data = ds.stt(audio, fs)
    with open('lectures/' + id + '.txt', 'a') as f:
        f.write(processed_data)
    return processed_data

Sorry, this does not answers my question.

Whoops, I added my graph instantiation above.

Ok, so .pb, not .pbmm, means huge memory usage, likely not to help in your case. Please check the documentation about the mmap file format.

Thanks! I’ll try that change and report back.

This worked! However, I got this result from my short audio file:

“een go at in e an e same is were er to o in tejest bused e bro o or litre an per andifolworesware o t al e o as aner reaerything is bid some min o so o o o e o la oro ah e or ee ingle is ateou ea to his head e ant oo the bot te o hii a se is bo i weo reb e be a arebut the a e o a o bo tha wy om back tothe oher”

Using release 0.3.0 models and packages I’m getting jibberish. Though my input has noise and reverb I’m surprised by the lack of English words. Is this to be expected?

It really depends on your audio sources, at some point. Can you make sure it’s 16 bits PCM, 16kHz and mono, first? Could you share some sample ?

Here is the snippet: https://drive.google.com/file/d/1guFjgkmwJbi_e5nsWuZeMY5njj9klTGj/view?usp=drivesdk

I used Audacity to convert it 16Khz Mono WAV Win 16bit PCM

What was the source ?

Source was a 44khz mp4.

mono, stereo? I assume it was mp3. It’s possible your conversion introduced some bad artifacts, but given the output, can we be sure of your exact setup / versions ? The full output should include those informations.

Sorry, source was stereo .m4a . I’m running the latest non-alpha (0.3.0) versions of the Python package and models. As the python code only returns a string with the transcription I’m not sure what “full output” I can provide. If you mean the CLI interface I don’t have access to that as I’m only able to run the program in a specific environment.

libdeepspeech.so produces some TensorFlow/DeepSpeech version infos on stderr. We need that.