After using Ljspeech, I am trying to feed it another dataset, namely the french part of mailabs.
I used preprocess.py to generate a nested list containing all the wav files that I saved as a csv file using csv.writer
import numpy as np
import csv
from datasets.preprocess import mailabs
items= mailabs("/content/gdrive/My Drive/Tacotron/fr_FR")
with open('/content/gdrive/My Drive/Tacotron/fr_FR/metadata.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerows(items)
I then proceeded to split it into train and eval with this code:
shuf metadata.csv > metadata_shuf.csv
head -n 12000 metadata_shuf.csv > metadata_train.csv
tail -n 1100 metadata_shuf.csv > metadata_val.csv
Then I changed the path and name of the dataset in config.json
"datasets": [{"name": "mailabs", "path": "/content/gdrive/My Drive/Tacotron/fr_FR", "meta_file_train": null, "meta_file_val": null}]
But I keep getting the same error:
File "/usr/local/lib/python3.6/dist-packages/TTS-0.0.3+7e799d5-py3.6.egg/TTS/utils/generic_utils.py", line 78, in split_dataset
assert eval_split_size > 0, " [!] You do not have enough samples to train. You need at least 100 samples."
AssertionError: [!] You do not have enough samples to train. You need at least 100 samples.
However I have 31143 rows in my csv with valid wav files.
I am not sure where to look next, I am thinking it might have to do with the order of the elements in the csv, but I am not sure.
@julian.weber I think you successfully trained with the french part of the m-ai-labs dataset, would you accept to share your process ?
EDIT: Nevermind, I realised that the preprocess was run inside the train function, so it worked when I changed the regex in the preprocess for mailabs to:
speaker_regex = re.compile("(male|female)/(?P<speaker_name>[^/]+)/")