Running inference on long audio files (30-45 minutes) sampled at 44.1kHz with DeepSpeech 0.7.0

A_N · May 5, 2020, 12:30am

I am trying to run the DeepSpeech inference engine version 0.7.0 on long audio files 30-45 minutes or even an hour long. The audio has been sampled at 44.1kHz. I see that client.convert_samplerate is using the Sox library and downsampling to 16kHz. This is just standard american accent with 2 speakers. The output I receive is way off and very often gibberish and is no way near the accuracy that I see on LibriSpeech test clean.
I tried clipping this to a 30 sec bit and the transcription only got marginally better.
Is there a suggested approach to address this ideally without retraining? Is retraining with larger audio clips even a solution here? I assume that this will be extremely slow even if I have the hardware to support this (and my batch size will have to be small).

Also is there a difference running from the 0.7.0 checkpoint files vs 0.7.0 pbmm model file?
Thanks

reuben · May 5, 2020, 9:18am

For what is worth, you can use the transcribe.py utility to quickly transcribe longer audio files with GPU batching. It is unlikely that your results have anything to do with the length of the file, most likely it’s the accent or recording characteristics that are doing poorly, or some other step is messed up like the downsampling, somehow.

What do you mean “running from the checkpoint files”? You mean with --one_shot_infer? If so, the biggest difference is that --one_shot_infer does not do resampling at all.

A_N · May 5, 2020, 3:54pm

Thanks for the response.

I meant specifying the --checkpoint_dir param vs --model param when using evaluate.py or transcribe.py or even running the DeepSpeech Python bindings with the 0.7.0 pbmm model. Should I expect the same transcription results whether I load this 0.7.0 checkpoint or 0.7.0 .pbmm model file?

I tried another test where I combined 10 of the longest files from LibriSpeech Test clean into a 5 min wav (my merge tool resulted in converting this to mp3 and upsampling to 48kHz). This longer upsampled LibriSpeech Test data produced expected transcription.

To provide some background on my test files - they are 2 or more people meeting recordings of atleast 30 min or longer in most cases, so there are occasional talk over each other scenarios as well. Is this something that should have been handled because the model was trained over Switchboard dataset (sorry I dont have access to this dataset and I am assuming since these are telephonic conversations, they will be some amount when people talk over each other).

I tried the transcript.py and it certainly is pretty fast when compared to the running the python and the results are slightly better. Do you suggest downsampling using any other library other than sox? As far as the accent goes, this is standard American accent but however there are more voice modulations than LibriSpeech as it is conversational. Do you have any suggestions on how this can be improved? Would you recommendation fine tuning the model as documented here

Thank you very much and appreciate your response.

reuben · May 5, 2020, 4:03pm

There is no tool that can take both types of input, so I’m not sure what you’re talking about. To answer the question in general, the results should be equivalent.

The assumption is incorrect, Switchboard has dedicated channels for each speaker, there is no overlap.

Nice!

SoX should be fine, just follow the flags used by our client. Some flags can affect the results somewhat, like dithering.

You can try fine tuning, just be mindful that it is a research task, not something you just hit a button and wait for it to be done. You’ll have to explore the parameters that work best for your data.

An easier possibility might be making your own language model, specially if the recorded conversations are about a specific subject matter, with jargon or unusual words and sentences. It is a faster experiment than fine tuning. There is some documentation on how to do that here: DeepSpeech/data/lm at master · mozilla/DeepSpeech · GitHub

A_N · May 9, 2020, 5:29pm

An easier possibility might be making your own language model, specially if the recorded conversations are about a specific subject matter, with jargon or unusual words and sentences. It is a faster experiment than fine tuning. There is some documentation on how to do that here: https://github.com/mozilla/DeepSpeech/tree/master/data/lm

Is there any limitation on the size of the language model? Would it make sense to create txt file with the entire wikipedia text and use that as input for the generate_lm.py script?
Is there a rule of thumb on adding sentences for any word? For example if I want the model to recognize company names like Verizon, AirBnb etc how sentences should I add to help transcribe these words and at the same time not disturb what it is already doing right. Will adding more sentences increase the WER?
Thanks

othiele · May 10, 2020, 12:30pm

The current language model has Wikipedia as far as I know, but the more data in it the merrier.

Ideally you have many sentences for every word.

WER will only improve if you add the right words Only words in the language model can be found.

And the standard parameters reduce the output a bit, if you compress/prune less, you’ll detect more, but you’ll need more memory. Always a compromise …

A_N · May 10, 2020, 6:14pm

Thanks for the reply, @othiele
Please correct if i am wrong here, I was under the assumption that the LM binary was generated from the LibriSpeech normalized LM training text, available here. Is that not the case? I did not see any references to Wikipedia normalized text being used, however did notice some posts from people trying to use Wikipedia text. To me the confusion is because a lot of the sentences in librispeech-lm-norm.txt dont make real sense and yet the model seems to working well on inference for the most part.

othiele · May 10, 2020, 7:34pm

Hm, good question, I am only using my own models for German. @lissyx what is used for the English language model, just LibriSpeech texts?

reuben · May 10, 2020, 7:47pm

Again, the documentation for the scorer is here, including the source corpus and the scripts and parameters needed to recreate it: https://github.com/mozilla/DeepSpeech/tree/master/data/lm

Topic		Replies	Views
Recommended approach for downsampling 44.1kHz audio to 16kHz to ensure accurate results? DeepSpeech	13	12506	June 2, 2020
Transcribing longer audio files DeepSpeech	17	2642	February 28, 2023
Inference with model different than 16kHz DeepSpeech	19	5291	December 18, 2019
Horrible results on inference. Help DeepSpeech	2	917	July 10, 2020
DeepSpeech benchmarking / Shorten inference time DeepSpeech	16	5433	February 14, 2018

Running inference on long audio files (30-45 minutes) sampled at 44.1kHz with DeepSpeech 0.7.0

Related topics