Spectrogram differences

I have some formatted audio data for training DeepSpeech. After formatting some new RAW data I took a look at the spectrograms and the differences are concerning.
It seems there was an unknown layer of audio manipulation while the training data was formatted.

I pulled the same clip from the raw audio used for training, formatted, and created the spectrograms - here are the differences:

Spectrogram of the formatted raw data
RAW

Spectrogram of the training audio
TRAINING

  • To the (my) human ear, both of these spectrograms sound the same.

I am wondering how/if these spectrogram differences will affect inference accuracy.

Thanks

I’m not sure I understand exactly what you do and where the data from the spectrogram comes ?

RAW audio files were converted to wav, upsampled, and transcribed by a third party.

When I covert the same segment from the original file to wav, and upsample - the spectrogram doesn’t look the same - I figure the third party did excessive, unknown, and unwanted audio manipulation.

Any insight into how/if these spectrogram differences would affect inference?

Maybe I’m misunderstanding what has happened, but I struggle to see how you think someone other than you + your third party partner could answer this :confused:

I have a large set of data (radio comms) that was formatted to wav, upsampled, and annotated from long clips.

The annotators are claiming the clips were loaded into audacity, upsampled, then transcribed. Nothing else.

I split the same segment from the original clip, upsampled with audacity and the resultant spectrogram is different from the annotated versions.

No audible differences between the two clips, but it seems some sort of noise removal function was applied.

Given the audio source being radio transmissions and all the clips having consistent frequencies… i fear the removal of all that ‘static noise’ (appearing in every sample) will affect the models ability to [‘accurately’] inference data that didnt have this function applied.

The problem is, i dont have timestamps to resplit the clips and I cant [exactly] replicate a spectrogram shape to be equal to that of the clips that were transcribed.

I think this will affect the models ability to inference on live data. To what extent is the question.

A poor data handling process, lessons learned.

Thanks

Got it. Sorry to hear that.
It’s your call but I’d be tempted to give it a go anyway, although that isn’t based on any particular insight on my part!

Oh so you got de-noised audio and you need to perform live inference on noisy audio ?

I assume some sort of de-noise function was subtlety applied that can’t be reproduced on any new clips.

To the human ear, the clips would be the same.
To the eyeballs on a spectrogram graph… some data is definitely missing from these transcribed clips.

Data collection isn’t fully complete… so kind of a blocker contemplating what actions we should preform on new data entering the collection - which in my opinion should be as basic as possible (format/convert, and resample) - the same actions when inferring on live data.

So hopefully this ‘de-noised’ audio data wont have any negative affects on the models accuracy once its trained alongside newly transcribed ‘noisy’ clips. But if it does… i guess it always has a spot in the cloud.

Thanks for DeepSpeech!

So you will have both, de-noised and noisy in your training set ? That might actually not be a bad thing.