I recently started using Deep Speech for audio transcriptions. I have doubts to clarify
Here is are the specification
Training or Inference - Both
DeepSpeech branch/version - 0.7.4
OS Platform and Distribution (e.g., Linux Ubuntu 18.04) - Linux - Ubuntu 18.04
Python version - 3.6.9
TensorFlow version - tensorflow-gpu==1.15.2
Inference :
1.What are the specifications to keep in mind regarding the audio file to be used for the inference?
Because I have different format file (.mp3,.wav, etc…)
2. Is there any restriction of the length of the audio( in mins )?
Training:
I started training the custom model . I want to understand what are the basic specifications for training model?
I understand it should be .wav file and mono channel . What is the maximum wav file size ?
Can we provide custom scorer file while training the custom model?
Thank you for the guidance @lissyx . I will try that out .
Could you please help me answering the queries related to inference part also ?
What are the specifications to keep in mind regarding the audio file to be used for the inference?
Because I have different format file (.mp3,.wav, etc…)
Is there any restriction of the length of the audio( in mins )?
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
6
Thank you for the information .
I want to understand one thing as you mentioned inference depends on training spec , so the following question is since in training we are allowed only to provide .wav files only with 16KHz audio sample rate , so does that mean in inference also we should convert the audio files to ,wav format and 16KHz for providing it for inference .
Why am I asking this because as I already mentioned I have audio files of different format ?
Please help me understand this to move forward in right direction.
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
8
We extract MFCC, so I guess you can do that with any format. We only test for WAV at 16kHz and 8kHz. The only constraint is that your inference data needs to match your training data.
This thread would have been useful if only we were told how important it is to match the audio codec of the recording. I am sure the more we match, the better it is. For example, is it more important to match the person that speak than the audio codex? It seems that we need to go deep in the documentation simply to get an idea of what is important to match. Just saying that the inference data needs to match the training data is not very informative. It does not help much in deciding a trade off between the time we spent at training versus quality.