Audio File Specifications to use Deep Speech

Harmandeep_Singh · July 13, 2020, 4:37pm

Hi All

I recently started using Deep Speech for audio transcriptions. I have doubts to clarify

Here is are the specification

Training or Inference - Both
DeepSpeech branch/version - 0.7.4
OS Platform and Distribution (e.g., Linux Ubuntu 18.04) - Linux - Ubuntu 18.04
Python version - 3.6.9
TensorFlow version - tensorflow-gpu==1.15.2

Inference :
1.What are the specifications to keep in mind regarding the audio file to be used for the inference?
Because I have different format file (.mp3,.wav, etc…)
2. Is there any restriction of the length of the audio( in mins )?

Training:

I started training the custom model . I want to understand what are the basic specifications for training model?

I understand it should be .wav file and mono channel . What is the maximum wav file size ?
Can we provide custom scorer file while training the custom model?

Thanks in advance!

othiele · July 13, 2020, 3:07pm

Please read the docs linked here

https://discourse.mozilla.org/t/what-and-how-to-report-if-you-need-support/62071/2

Harmandeep_Singh · July 13, 2020, 3:22pm

@othiele , Apologies new to this . I update the question with the specifications.

lissyx · July 13, 2020, 3:59pm

Depends on your hw and batch size. You need to experiment. We use ~10-15s max per WAV.

Please read the doc, it’s all documented explicitely. Scorer is not used at training, only during evaluation step.

Harmandeep_Singh · July 13, 2020, 4:26pm

Thank you for the guidance @lissyx . I will try that out .

Could you please help me answering the queries related to inference part also ?

What are the specifications to keep in mind regarding the audio file to be used for the inference?
Because I have different format file (.mp3,.wav, etc…)
Is there any restriction of the length of the audio( in mins )?

lissyx · July 13, 2020, 5:00pm

Inference specs depends on your training specs.

Harmandeep_Singh · July 14, 2020, 3:04pm

Thank you for the information .
I want to understand one thing as you mentioned inference depends on training spec , so the following question is since in training we are allowed only to provide .wav files only with 16KHz audio sample rate , so does that mean in inference also we should convert the audio files to ,wav format and 16KHz for providing it for inference .

Why am I asking this because as I already mentioned I have audio files of different format ?

Please help me understand this to move forward in right direction.

lissyx · July 14, 2020, 5:11pm

We extract MFCC, so I guess you can do that with any format. We only test for WAV at 16kHz and 8kHz. The only constraint is that your inference data needs to match your training data.

dominic.mayers · April 1, 2021, 3:48am

This thread would have been useful if only we were told how important it is to match the audio codec of the recording. I am sure the more we match, the better it is. For example, is it more important to match the person that speak than the audio codex? It seems that we need to go deep in the documentation simply to get an idea of what is important to match. Just saying that the inference data needs to match the training data is not very informative. It does not help much in deciding a trade off between the time we spent at training versus quality.