Pitch_and_tempo augmentation

Hi Team,
This is regarding the newly added augmentation scheme.


I had a query about using pitch and tempo augmentation. I used the augmentation scheme on log mel spectrograms rather than plain STFT’s spectrogram. Since in the end we always re-shape the final output to [-1, c.n_input], if at all my log mel spectrogram has different shape how can the reshaping be done if the dimensions are not a multiple of 26.

Another question, while doing pitch and tempo scaling, the code seems to alter the height and width of the spectrogram. Pitch should ideally work on the frequency bin(width) dimension of the spectrogram tensor and tempo on the height dimension. Let me know where am I getting wrong here because the code seems to do just the opposite.

Hi,
the augmentation keeps the same height (frequency axis), but the width (time axis) can vary according the scaling params.

Check the following example code

Regarding the spectrogram axis, I believe frequency as the y-axis and time as the x-axis is the default, either for librosa/tensorflow spectrogram computations, as for visualization.

In other words, the augmentation will maintain the same shape[1] of the input, while shape[2] can chance. Oh, shape[0] is the size of the batch.

Let me know if you need anything

1 Like

Thanks a lot first for all for the help.
Yes for the visualisation purpose we generally keep frequency on y-axis and width or call it number of frames on the x-axis. But while calculating spectrogram from the function contrib_audio.audio_spectrogram, the dimensions are : The first dimension is for the channels in the input, so a stereo audio input would have two here for example. The second dimension is time, with successive frequency slices. The third dimension has an amplitude value for each frequency during that time slice. That means a tensor with shape [1,x,y] means the audio had a mono channel, x is the time and y is the unique fft bins(fft_length//2 +1). In that case if the code changes the shape[2] parameter, it will not be compatible with the final shaping of the tensor to [-1,26].
It might work for calculating mfccs from the augmented spectrogram because dct-coeff =26 is set after the augmentation so it easily re-shapes it to the desired value but in case some one want to augment a log-mel spectrogram and re-shape it to [-1,26] the code throws up the error : “input tensor shape not a multiple of 26.”

Another query: pitch usually changes the frequency dimension and tempo should work on time dimension. If we go according to the shape of the spectrograms pitch scaling should work on shape[2] and tempo scaling on shape[1]. Kindly check it through. This confusion might be because librosa returns frequency in x and time frames in y axis. Kindly check and please update.

P.S : Thanks a ton for walking me through this and the augmentation on time-freq masking is already working wonders!! :slight_smile:

1 Like

In this case you should change the crop_to_bounding_box and pad_to_bounding_box (specifically the target_height and target_width). The cropping is used when the frequency axis is scaled to a higher size than the input’s frequency, and padding is used when the frequency is scaled to a lower size.

Let me know if you got it.

Yes I did the same thing!
Here’s the code
def augment_pitch_and_tempo(spec1,
max_tempo=1.2,
max_pitch=1.1,
min_pitch=0.95):
original_shape = tf.shape(spec1)
print(“original_height=”,original_shape[1])
print(“original_width=”,original_shape[2])
choosen_pitch = tf.random.uniform(shape=(), minval=min_pitch, maxval=max_pitch)
print(“choosen_pitch=”,choosen_pitch)
choosen_tempo = tf.random.uniform(shape=(), minval=1, maxval=max_tempo)
print(“choosen_tempo”,choosen_tempo)
new_width = tf.cast(tf.cast(original_shape[2], tf.float32)*choosen_pitch, tf.int32)
print(“new-width=”,new_width)
new_height = tf.cast(tf.cast(original_shape[1], tf.float32)/(choosen_tempo), tf.int32)
print(“new_height=”,new_height)
spectrogram_aug = tf.image.resize_bilinear(tf.expand_dims(spec1, -1), [new_height, new_width])
print(“bilinear_augmented=”,spectrogram_aug.shape)
spectrogram_aug = tf.image.crop_to_bounding_box(spectrogram_aug, offset_height=0, offset_width=0, target_height=tf.shape(spectrogram_aug)[1], target_width=tf.minimum(original_shape[2],new_width))
print(“crop-to-bounding=”,spectrogram_aug.shape)
spectrogram_aug = tf.cond(choosen_pitch < 1,
lambda: tf.image.pad_to_bounding_box(spectrogram_aug, offset_height=0, offset_width=0,target_height=tf.shape(spectrogram_aug)[1], target_width=original_shape[2]),
lambda: spectrogram_aug)
print(“augmented=”,spectrogram_aug.shape)
return spectrogram_aug[:, :, :, 0]

def augment_speed_up(spec1,
speed_std=0.1):
original_shape = tf.shape(spec1)
choosen_speed = tf.math.abs(tf.random.normal(shape=(), stddev=speed_std)) # abs makes sure the augmention will only speed up
print(“choosen_speed=”,choosen_speed)
choosen_speed = 1 + choosen_speed
print(“final_speed=”,choosen_speed)
new_height = tf.cast(tf.cast(original_shape[1], tf.float32)/(choosen_speed), tf.int32)
print(“new_height=”,new_height)
new_width = tf.cast(tf.cast(original_shape[2], tf.float32), tf.int32)
print(“new_width=”,new_width)
spectrogram_aug = tf.image.resize_bilinear(tf.expand_dims(spec1, -1), [new_height, new_width])
return spectrogram_aug[:, :, :, 0]

And it runs fine now :slight_smile:
Thanks a ton again!! :):smiley:

Hi @sumegha19 , could you please edit your message and quote the code with ``` code ``` so it becomes more understandable? Thanks!

“”"def augment_pitch_and_tempo(spectrogram,
max_tempo=1.2,
max_pitch=1.1,
min_pitch=0.95):
original_shape = tf.shape(spectrogram)
choosen_pitch = tf.random.uniform(shape=(), minval=min_pitch, maxval=max_pitch)
choosen_tempo = tf.random.uniform(shape=(), minval=1, maxval=max_tempo)
new_width = tf.cast(tf.cast(original_shape[2], tf.float32)*choosen_pitch, tf.int32)
new_height = tf.cast(tf.cast(original_shape[1], tf.float32)/(choosen_tempo), tf.int32)
spectrogram_aug = tf.image.resize_bilinear(tf.expand_dims(spectrogram, -1), [new_height, new_width])
spectrogram_aug = tf.image.crop_to_bounding_box(spectrogram_aug, offset_height=0, offset_width=0, target_height=tf.shape(spectrogram_aug)[1], target_width=tf.minimum(original_shape[2],new_width))
spectrogram_aug = tf.cond(choosen_pitch < 1,
lambda: tf.image.pad_to_bounding_box(spectrogram_aug, offset_height=0, offset_width=0,target_height=tf.shape(spectrogram_aug)[1], target_width=original_shape[2]),
lambda: spectrogram_aug)
return spectrogram_aug[:, :, :, 0]

def augment_speed_up(spectrogram,
speed_std=0.1):
original_shape = tf.shape(spectrogram)
choosen_speed = tf.math.abs(tf.random.normal(shape=(), stddev=speed_std)) # abs makes sure the augmention will only speed up
choosen_speed = 1 + choosen_speed
new_height = tf.cast(tf.cast(original_shape[1]/(choosen_speed), tf.float32), tf.int32)
new_width = tf.cast(tf.cast(original_shape[2], tf.float32), tf.int32)
spectrogram_aug = tf.image.resize_bilinear(tf.expand_dims(spectrogram, -1), [new_height, new_width])
return spectrogram_aug[:, :, :, 0] “”"

Indent might not be in place! Please adjust that. This code is just for pitch_tempo_augmentation and speed_up-augmentation. Rest all the functions seem to be running fine.

Please use ``` instead of ´´´´.