What is the purpose of train-all.csv

chibt · May 31, 2021, 3:49pm

I am following this doc to train my own English model using CommonVoice data
https://deepspeech.readthedocs.io/en/r0.9/TRAINING.html

After running this command:

bin/import_cv2.py --filter_alphabet path/to/some/alphabet.txt /path/to/extracted/language/archive

there are files generated

clips/dev.csv
clips/test.csv
clips/train.csv
clips/train-all.csv

Then the next step is to train the model using clips/dev.csv, clips/test.csv and clips/train.csv.
Why don’t we use clips/train-all.csv as training data? This file have a lot more data than clips/train.csv and also from validated dataset so I think it should output a better model. But in the doc I do not see any mention about this file.
Also, was DeepSpeech pre-trained model trained from clips/train.csv or clips/train-all.csv?

lissyx · May 31, 2021, 5:16pm

No, if you train with validation dataset, you just overfit and learn nothing.

chibt · June 1, 2021, 3:06am

Hi I do not train with validation dataset.
What I mean by “validated dataset” is this file en/validated.tsv which is already validated its quality by up votes and down votes. It is different and not for validation while training.

Anyway I just want to know If I should use en/clips/train-all.csv instead of en/clips/train.csv for training. I am sure that they do not include dev and test dataset

ftyers · June 1, 2021, 6:32pm

train-all.csv contains the training data that doesn’t have the one recording per transcript restriction enabled. See: https://github.com/mozilla/CorporaCreator/issues/113

Ugur_Turkdamar · February 14, 2023, 9:14am

hey, my folder doesn’t have clips/train-all.csv file, any solution? cannot use import_cv2 script…