I have some formatted audio data for training DeepSpeech. After formatting some new RAW data I took a look at the spectrograms and the differences are concerning.
It seems there was an unknown layer of audio manipulation while the training data was formatted.
I pulled the same clip from the raw audio used for training, formatted, and created the spectrograms - here are the differences:
RAW audio files were converted to wav, upsampled, and transcribed by a third party.
When I covert the same segment from the original file to wav, and upsample - the spectrogram doesn’t look the same - I figure the third party did excessive, unknown, and unwanted audio manipulation.
Any insight into how/if these spectrogram differences would affect inference?
I have a large set of data (radio comms) that was formatted to wav, upsampled, and annotated from long clips.
The annotators are claiming the clips were loaded into audacity, upsampled, then transcribed. Nothing else.
I split the same segment from the original clip, upsampled with audacity and the resultant spectrogram is different from the annotated versions.
No audible differences between the two clips, but it seems some sort of noise removal function was applied.
Given the audio source being radio transmissions and all the clips having consistent frequencies… i fear the removal of all that ‘static noise’ (appearing in every sample) will affect the models ability to [‘accurately’] inference data that didnt have this function applied.
The problem is, i dont have timestamps to resplit the clips and I cant [exactly] replicate a spectrogram shape to be equal to that of the clips that were transcribed.
I think this will affect the models ability to inference on live data. To what extent is the question.
I assume some sort of de-noise function was subtlety applied that can’t be reproduced on any new clips.
To the human ear, the clips would be the same.
To the eyeballs on a spectrogram graph… some data is definitely missing from these transcribed clips.
Data collection isn’t fully complete… so kind of a blocker contemplating what actions we should preform on new data entering the collection - which in my opinion should be as basic as possible (format/convert, and resample) - the same actions when inferring on live data.
So hopefully this ‘de-noised’ audio data wont have any negative affects on the models accuracy once its trained alongside newly transcribed ‘noisy’ clips. But if it does… i guess it always has a spot in the cloud.
Thanks for DeepSpeech!
lissyx
((slow to reply) [NOT PROVIDING SUPPORT])
9
So you will have both, de-noised and noisy in your training set ? That might actually not be a bad thing.