Hi,
just wondering, does it matter when using a multi-speaker dataset if some speakers are not used in the evaluation data?
For example i have a dataset with 3 speakers with a distribution of
speaker 1: 10000 items
speaker 2: 1000 items
speaker 3: 1000 items
by chance all the evaluation items are pulled from speaker 1. Does it matter?
code for splitting in generic_utils.py
def split_dataset(items):
is_multi_speaker = False
speakers = [item[-1] for item in items]
is_multi_speaker = len(set(speakers)) > 1
eval_split_size = 500 if len(items) * 0.01 > 500 else int(
len(items) * 0.01)
np.random.seed(0)
np.random.shuffle(items)
if is_multi_speaker:
items_eval = []
# most stupid code ever -- Fix it !
while len(items_eval) < eval_split_size:
speakers = [item[-1] for item in items]
speaker_counter = Counter(speakers)
item_idx = np.random.randint(0, len(items))
if speaker_counter[items[item_idx][-1]] > 1:
items_eval.append(items[item_idx])
del items[item_idx]
return items_eval, items
else:
return items[:eval_split_size], items[eval_split_size:]