I am trying to run the DeepSpeech inference engine version 0.7.0 on long audio files 30-45 minutes or even an hour long. The audio has been sampled at 44.1kHz. I see that client.convert_samplerate is using the Sox library and downsampling to 16kHz. This is just standard american accent with 2 speakers. The output I receive is way off and very often gibberish and is no way near the accuracy that I see on LibriSpeech test clean.
I tried clipping this to a 30 sec bit and the transcription only got marginally better.
Is there a suggested approach to address this ideally without retraining? Is retraining with larger audio clips even a solution here? I assume that this will be extremely slow even if I have the hardware to support this (and my batch size will have to be small).
Also is there a difference running from the 0.7.0 checkpoint files vs 0.7.0 pbmm model file?
Thanks
For what is worth, you can use the transcribe.py utility to quickly transcribe longer audio files with GPU batching. It is unlikely that your results have anything to do with the length of the file, most likely it’s the accent or recording characteristics that are doing poorly, or some other step is messed up like the downsampling, somehow.
What do you mean “running from the checkpoint files”? You mean with --one_shot_infer? If so, the biggest difference is that --one_shot_infer does not do resampling at all.
I meant specifying the --checkpoint_dir param vs --model param when using evaluate.py or transcribe.py or even running the DeepSpeech Python bindings with the 0.7.0 pbmm model. Should I expect the same transcription results whether I load this 0.7.0 checkpoint or 0.7.0 .pbmm model file?
I tried another test where I combined 10 of the longest files from LibriSpeech Test clean into a 5 min wav (my merge tool resulted in converting this to mp3 and upsampling to 48kHz). This longer upsampled LibriSpeech Test data produced expected transcription.
To provide some background on my test files - they are 2 or more people meeting recordings of atleast 30 min or longer in most cases, so there are occasional talk over each other scenarios as well. Is this something that should have been handled because the model was trained over Switchboard dataset (sorry I dont have access to this dataset and I am assuming since these are telephonic conversations, they will be some amount when people talk over each other).
I tried the transcript.py and it certainly is pretty fast when compared to the running the python and the results are slightly better. Do you suggest downsampling using any other library other than sox? As far as the accent goes, this is standard American accent but however there are more voice modulations than LibriSpeech as it is conversational. Do you have any suggestions on how this can be improved? Would you recommendation fine tuning the model as documented here
There is no tool that can take both types of input, so I’m not sure what you’re talking about. To answer the question in general, the results should be equivalent.
The assumption is incorrect, Switchboard has dedicated channels for each speaker, there is no overlap.
Nice!
SoX should be fine, just follow the flags used by our client. Some flags can affect the results somewhat, like dithering.
You can try fine tuning, just be mindful that it is a research task, not something you just hit a button and wait for it to be done. You’ll have to explore the parameters that work best for your data.
An easier possibility might be making your own language model, specially if the recorded conversations are about a specific subject matter, with jargon or unusual words and sentences. It is a faster experiment than fine tuning. There is some documentation on how to do that here: DeepSpeech/data/lm at master · mozilla/DeepSpeech · GitHub
An easier possibility might be making your own language model, specially if the recorded conversations are about a specific subject matter, with jargon or unusual words and sentences. It is a faster experiment than fine tuning. There is some documentation on how to do that here: https://github.com/mozilla/DeepSpeech/tree/master/data/lm
Is there any limitation on the size of the language model? Would it make sense to create txt file with the entire wikipedia text and use that as input for the generate_lm.py script?
Is there a rule of thumb on adding sentences for any word? For example if I want the model to recognize company names like Verizon, AirBnb etc how sentences should I add to help transcribe these words and at the same time not disturb what it is already doing right. Will adding more sentences increase the WER?
Thanks
The current language model has Wikipedia as far as I know, but the more data in it the merrier.
Ideally you have many sentences for every word.
WER will only improve if you add the right words Only words in the language model can be found.
And the standard parameters reduce the output a bit, if you compress/prune less, you’ll detect more, but you’ll need more memory. Always a compromise …
Thanks for the reply, @othiele
Please correct if i am wrong here, I was under the assumption that the LM binary was generated from the LibriSpeech normalized LM training text, available here. Is that not the case? I did not see any references to Wikipedia normalized text being used, however did notice some posts from people trying to use Wikipedia text. To me the confusion is because a lot of the sentences in librispeech-lm-norm.txt dont make real sense and yet the model seems to working well on inference for the most part.