What is the purpose of train-all.csv

I am following this doc to train my own English model using CommonVoice data
https://deepspeech.readthedocs.io/en/r0.9/TRAINING.html

After running this command:

bin/import_cv2.py --filter_alphabet path/to/some/alphabet.txt /path/to/extracted/language/archive

there are files generated

  • clips/dev.csv
  • clips/test.csv
  • clips/train.csv
  • clips/train-all.csv

Then the next step is to train the model using clips/dev.csv, clips/test.csv and clips/train.csv.
Why don’t we use clips/train-all.csv as training data? This file have a lot more data than clips/train.csv and also from validated dataset so I think it should output a better model. But in the doc I do not see any mention about this file.
Also, was DeepSpeech pre-trained model trained from clips/train.csv or clips/train-all.csv?

No, if you train with validation dataset, you just overfit and learn nothing.

Hi I do not train with validation dataset.
What I mean by “validated dataset” is this file en/validated.tsv which is already validated its quality by up votes and down votes. It is different and not for validation while training.

Anyway I just want to know If I should use en/clips/train-all.csv instead of en/clips/train.csv for training. I am sure that they do not include dev and test dataset

train-all.csv contains the training data that doesn’t have the one recording per transcript restriction enabled. See: https://github.com/mozilla/CorporaCreator/issues/113

1 Like

hey, my folder doesn’t have clips/train-all.csv file, any solution? cannot use import_cv2 script…