DeepSpeech training with large files

ctzogka · June 20, 2019, 10:07am

Hello, i am new in this forum and i am trying to build an English model that will have good results on large files (more than 10m). I’ve been using Common-Voice released dataset and i noticed that it performs well enough on test set (WER < 20%) but i would like to know if i can train and test on larger files. Is there any limitation from model’s architecture that makes it appropriate for this occasion? Let me mention that i tested my model on a 10m BBC-news audio and the results were disappointing…
Waiting for your response, thank you !

kdavis · June 20, 2019, 12:43pm

Training on long files, e.g. 10min, isn’t recommended. That said decoding, i.e. transcribing, longer files is supported. Did you use the streaming API?

ctzogka · June 20, 2019, 12:56pm

First of all thank you for the response! I am glad to hear that decoding/transcribing longer files is supported. Where can i find the streaming API, is it related to transcribing long files? I have transcribed an audio (10m) using command:
./deepspeech --model /home/christina/PycharmProjects/DeepSpeech1/data_val/models/output_graph.pbmm --alphabet /home/christina/alphabet.txt --lm /home/christina/lm.binary --trie /home/christina/trie --audio /home/christina/Downloads/BBC-news.wav
and i got poor results, with many words stucked together. Is there any way for better results?

lissyx · June 20, 2019, 4:57pm

github.com/mozilla/DeepSpeech

native_client/deepspeech.h

080fc27c6


      
          /**
           * @brief Create a new streaming inference state. The streaming state returned
           *        by this function can then be passed to {@link DS_FeedAudioContent()}
           *        and {@link DS_FinishStream()}.
           *
           * @param aCtx The ModelState pointer for the model to use.
           * @param aPreAllocFrames Number of timestep frames to reserve. One timestep
           *                        is equivalent to two window lengths (20ms). If set to 
           *                        0 we reserve enough frames for 3 seconds of audio (150).
           * @param aSampleRate The sample-rate of the audio signal.
           * @param[out] retval an opaque pointer that represents the streaming state. Can
           *                    be NULL if an error occurs.
           *
           * @return Zero for success, non-zero on failure.
           */
          DEEPSPEECH_EXPORT
          int DS_SetupStream(ModelState* aCtx,
                             unsigned int aPreAllocFrames,
                             unsigned int aSampleRate,
                             StreamingState** retval);

This file has been truncated. show original

ctzogka · June 21, 2019, 6:33am

I found this file in native_client folder but i can’t imagine how i could use is to transcribe my large audio. Is there any command? Please forgive me, i am a beginner… Also, i found this issue https://github.com/mozilla/DeepSpeech/tree/master/examples/ffmpeg_vad_streaming and i think it is related to my problem. But when i ran it, i got blank inference. However, it works properly on small wav (e.g. CommonVoice sample).

lissyx · June 21, 2019, 4:18pm

Well, you need to write code to use the streaming API, the examples should be fine. I can’t know why this specific example would give blank inference, however. There’s no reason. Those examples are from contributors, they may regress.

I gave you the entry points you need, we expose those in other languages as well, look into the examples. I can’t really help more unless you have precise questions.

ctzogka · June 23, 2019, 9:15am

Thank you for your support, i solved the issue with the blank inference! Probably it occured due to the file type, i converted it to the appropriate format (16bit, mono-channel, 16KH) and now it works properly. I’ll try to call the streaming API from the examples.
You’ve done a great job with DeepSpeech Project, keep going!

Topic		Replies	Views
Transcribing longer audio files DeepSpeech	17	2685	February 28, 2023
Running inference on long audio files (30-45 minutes) sampled at 44.1kHz with DeepSpeech 0.7.0 DeepSpeech	8	1999	May 10, 2020
Can DeepSpeech process longer audio files? DeepSpeech	5	6427	December 18, 2019
Audio File Specifications to use Deep Speech DeepSpeech learning	8	5662	April 1, 2021
Standard Method for Processing Long Audio Files with 0.3.0/0.4.0 Python Package? DeepSpeech	27	1758	November 22, 2018

DeepSpeech training with large files

Related topics