Common voice dataset importing problem

Md_Sakib_Ul_Rahman_Sourove · May 10, 2023, 5:40pm

“bin/import_cv2.py --filter_alphabet data/alphabet.txt deepspeech-data/cv-corpus-12.0-delta-2022-12-07-en/en/”
or the importer is generating three csv files - train-all.csv, other.csv, validated.csv
It is not generating the train.csv, dev.csv, test.csv files. is this a problem or should i manually split the train-all.csv into train, dev, test files.
TIA

canseo · August 12, 2023, 6:28am

The DeepSpeech bin/import_cv2.py script is responsible for importing and preprocessing the Common Voice dataset. By default, it generates three CSV files: train-all.csv , other.csv , and validated.csv . These files contain different subsets of the dataset.

The train-all.csv file contains all the available training data, while other.csv and validated.csv contain data that can be used for validation or testing purposes. However, these files do not provide a predefined split for training, development, and testing.

If you want to follow a specific split for training, development, and testing, you will need to manually split the train-all.csv file into the desired subsets. Typically, this involves randomly dividing the data into three sets: training, development, and testing.