Is there a way to conclude exactly what to do to improve the dataset or at least interpret the graphs properly?
Could this be modified to even be presented a text file or a set of audio files only and it would choose the parts it needs to improve the model?
With such a high number on short audio clips the model might have problems when infering longer phrases. Nevertheless you can try training a DDC/DCA model, you should see if it works well after approx. 20k steps.