CV German V.11 for training and evaluating an ASR model

Dear all,

I just started writing code for training an ASR model (wav2vec 2.0) with the Common Voice German dataset (with little knowledge about programming). The goal is to evaluate the model’s performance on non-native German speakers since ASR models still make more errors with non-native speech than with the native one. Therefore, the German dataset should be splitted into German native and German non-native subsets according to the information I get from the “accents”-metadata column (e.g. “Deutschland-Deutsch” vs. “Französisch-Deutsch”).

Unfortunately, when preprocessing the training set for the following training part of my code, I always get the error message “No space left on device”. Besides the fact, that I use Colab Pro (with access to 40GB GPU from NVIDIA and RAM of 89.6GB), the size of the German dataset seems to be too large for getting processed.

Second, I was wondering whether I could do someting with the CorporaCreator (checking the data and filtering it from downgraded files etc.) but when I run “create-corpora -d corpora -f clips.tsv” on PyCharm, the message “Permission denied: ‘clips.tsv’” appears.

Does anybody have some ideas on how to solve these issues? Every hint would be very helpful!

Thank you very much in advance for your response.

Best,
Kristina

Hallo Kristina,

  1. Yes you will run out of disk space with that amount of data with Colab Pro, you would need Colab Pro+ perhaps. One possibility might be to use Google Drive to keep the expanded dataset, but (i) it will be slower (ii) with too much reading, this time Google Drive will prevent you.

According to my calculations, somewhere between 100-120 hours of data, Google Pro will become useless.

  1. CorporaCreator (CC): It is used by Common Voice (CV) team manually to get the default splits. First they get a dump of the database into clips.tsv file, which contains the whole set of locales with all recordings. CC then divides it into locales, then into validated, invalidated and other files. Next it works on validated.tsv to get train, dev and test splits.

There is no clips.tsv file in the distribution, probably hence the error. It would be re-built by combining validated, invalidated and other files (see here for an implementation) , but it is not so much necessary, you would just need the validated one to work on.

AFAIK, you cannot use more current datasets as the accents data column would be empty. On older datasets, e.g. v7.0, that info exists. Also CC does not use that field to filter, you should do it yourself - if it were possible.

For your project at hand, I assume you need to somehow tag the dataset (working on validated.tsv) and divide it into two or more. val1.tsv, val2.tsv etc…

Afterwards, you can use CC to divide it into train/dev/test. But this time, as the test sets will be different, results will not be scientifically comparable - I think.

Did you check the Artie Bias Corpus for a possible methodology?

1 Like

Hello @bozden, thank you so much for your message! I will check it and read the article from Artie Bias corpus.

Best,
Kristina