Misses some words and some extra words are appended at begin and end during inference with v0.6.1

We have total around 2300-2400 hrs of data in Hinglish language (80-85% Hindi and 15-20 % English).

Our training data audio is split into chunks of 1.5 sec to 10 sec.

DeepSpeech training command:

./DeepSpeech.py --train_files data/train.csv --dev_files data/dev.csv --test_files data/test.csv --alphabet_config_path data/alphabet.txt --checkpoint_dir ~/checkpoint_dir --epochs 20 --export_dir ~/export_dir --lm_binary_path data/lm.binary --lm_trie_path data/trie --report_count 10 --show_progressbar true --train_batch_size 48 --dev_batch_size 48 --test_batch_size 48 --learning_rate 0.0001 --dropout_rate 0.15 --n_hidden 2048 --audio_sample_rate 8000

We do prediction on whole audio call and the chunks of the same audio calls (split on vad).

Problems that We face:

  1. vad does not remove noise properly?
    2.some new words like ok, haan, sir etc are added at the begining and end of the chunk predicted output.
  2. some words are missing in between the chunks?
  3. words are missing when inference on whole audio call but extra word append is not there in this case?

Examples:
1.
orig: kya meri baat ******(name) dutt(lname) ji se ho rahi hai kaise ho aap sir
predicted_with_lm: meri baat jo date ho rahi hai jaise
acoustic output: temse eri baas jobeste ho rahi hai kaise ape

Orig: app open kroge vhi pr sir top pr naam show ho rha hoga and jahan pe naam show ho rha hoga vahan pr teen dots bane hoge theek hai sir sir unka jo number hai wo switch off aa raha hai

predicted_with_lm: aap open karoge to ek baar top pe aapka naam se ho raha hoga theek hai sir india mart ki aapne yahan pe naam se naam ka kar paneer steamer hai na usko raha hai ki inidible

acoustic output: aap open karaoge oek baar top pe aapka naam se oraha hoga theek hai sir indiamart ki aapme aanja pe aam s oga naam ka ka ee ar opan nme ki sir sir to baje number hai na usseco faraha hai ska sirnai in.
3.
orig: This call may be recorded traininig and quality purpose
predicted_with_lm: iis call we will recorded ten quality purpose
acoustic_output: this call we will recorded ten quanity polpose

orig: and your service will be ended upto **** February 2022 ok and your service will be, and you will get twenty ****** per week and all ******* are lapsable and your service will be updated in ten to fifteen days sir

predicted_with_lm: and your service and will be for a two thousand twenty two ok sir your service will be you will get antimalarial you are the table and your service activation ten to fifteen days

acoustic_output: and your service banded wuill be ****** two thousand tenty two ok r your sirthis til be ou will get tente byi cer de aldyou sar let taple and your services ectivation ten to fifteen days

orig: Third party team will visit for physical verification after taking appointment from yo ok sir
predicted_with_lm: third party team will be it or scale verification after the an content for you ok sir
acoustic_output: thir parti te wil betit or isical verification after dekain aont mont fom you ok sir

Any help appreciated.
Thanks

1 Like

No need to be so pushy and filing issues.

What are you talking about ? There is no VAD in our codebase. And VAD stands for Voice Activity Detection, why would you expect that to de-noise ?

Please define properly your problem here. You have two paths for inference, it’s unclear in your statement of problems whether you have the issues in both cases …

Depends on what / how you chunk, if you remove vital information the model might not be able to get the words

Same

I don’t understand your sentence.

This is very hard to read, please ensure proper formatting, avoid mixing lists and pure text.

What about the LM ? What is the source of those data ?

How did you split your data into those sets ? What is the training log, loss evolution and test set WER ?

Let me revert with a full description of the process that I followed.
With focus on just the missing words.

Data distribution:
Total data: 2300-2400 hrs
out of which 1700-1800 hrs of data are mono channel and 500-600 hrs of data are stereo channel.

stereo calls are converted into mono using: sox infile.wav l_outfile.wav remix 1 and sox infile.wav r_outfile.wav remix 2
Each audio file is splitted into chunks between 1.5 sec to 10 sec.
Audio is splitted using pydub AudioSegment:
audio = AudioSegment.from([audio_cal.wav])
audio_chunk = audio[ int((start_time-0.01)*1000) : int((end_time+0.01)*1000) ] where 0.01 sec buffer is uesd for smooth audio cut.

stereo data are bit noisy and words are spoken bit fast, so did not get good transcript during prediction.
so we decide to normalize the audio peak and got better trans than previous.

audio chunks were peak normalized using : ffmpeg-normalize input_wav_chunk.wav --normalization-type peak --target-level -14 --output norm_wav_chunk.wav

We normalized our whole dataset.

Training params:
wer: 0.29, train loss: 50.0, validation loss: 65.17.

Prediction:
convert stereo audio into mono and then normalize it and then predict using methods mention below.
We tried prediction in two ways:

  1. split audio into chunk based on vad using webrtcvad and then predict that chunk using model.
  2. predict model on whole audio.

In both the cases words or even sentences are missed.

That’s a high WER. We don’t have a loss evolution, so it’s not conclusive.

You only run predicition on your stereo sub-dataset ? Nothing coming from the mono less noisy part?

our use case data is stereo only, that’s why we predict on stereo data only.
Since we have less stereo data for training, we used mono channel data as well.
mono channel audio is dual channel source separated wav files.
Stereo are not much noisy but words are spoken on some faster pace.

Sorry but you say you run inference on stereo? Without any conversion?

So your model struggles because you just don’t have enough data, I guess.

As mentioned earlier, we predict on the stereo calls by first splitting them into the mono left and right channels using sox remix. Then we also normalise them using ffmpeg-normalize.

we then run the prediction on the 2 mono calls that we get from each stereo call.

Right, your wording above was confusing. You perform a lot of audio changes, including normalization, that might have an impact / degrade the sound …

Does the pace at which the speakers are talking matter in the training?
If so then can we augment data by introducing variations in tempo and speed?
Is this advisable?

Yes, it has an impact

I think we have that in the code, but I have not yet had a chance to play with that.

We perform the same steps while training too. That should neutralise the effect a bit. Correct?

What other way do you suggest for training, if not giving the calls in chunks.
Or what size of chunks do you recommend.

Also during predictions.
what is better… to predict on the whole length of the call?
or to predict on the chunks made from VAD.

That should, indeed, but that also means your model learns this bias.

1.5-10s seems fine. I’m more worried about how you cut, and about the normalization steps/

We have no problem on 3min videos on a model trained with up-to 10sec audio, so the former should be fine. Adding VAD adds a variable to your equation when you want to remove as much as possible to isolate your issue.

what problem do you think this portion might be introducing?
or can you suggest a way that you tried that wont have the problems you think this will.

I don’t have your dataset, I can’t do your analysis.