I just started writing code for training an ASR model (wav2vec 2.0) with the Common Voice German dataset (with little knowledge about programming). The goal is to evaluate the model’s performance on non-native German speakers since ASR models still make more errors with non-native speech than with the native one. Therefore, the German dataset should be splitted into German native and German non-native subsets according to the information I get from the “accents”-metadata column (e.g. “Deutschland-Deutsch” vs. “Französisch-Deutsch”).
Unfortunately, when preprocessing the training set for the following training part of my code, I always get the error message “No space left on device”. Besides the fact, that I use Colab Pro (with access to 40GB GPU from NVIDIA and RAM of 89.6GB), the size of the German dataset seems to be too large for getting processed.
Second, I was wondering whether I could do someting with the CorporaCreator (checking the data and filtering it from downgraded files etc.) but when I run “create-corpora -d corpora -f clips.tsv” on PyCharm, the message “Permission denied: ‘clips.tsv’” appears.
Does anybody have some ideas on how to solve these issues? Every hint would be very helpful!
Thank you very much in advance for your response.