Data distribution among sets

Hi all,

I am a researcher working on a better machine understanding of pathological speech. I have already been working with /de/ Commonvoice and found that data is partitioned in a very specific way. Among the different partitions, I found one for train/dev/test. From a data science perspective, I strongly appreciate this effort. What I wonder is: Do these splits assert that one speaker does not appear in more than one of the three sets? Usually, you want to make sure that samples of a speaker are assigned to only one split to make sure you validate on “unseen” speakers.

Hey @PKlumpp,

"Each test/train/dev set is generated non-deterministically, meaning that they will vary from release to release even for minor updates. This is to avoid reproducing and perpetuating any demographic skews in each subsequent set. "

For more details on the metadata please check out the github: https://github.com/common-voice/cv-dataset

I would love to learn more about your project, if you would like to please feel free to share on this thread: Talk to us! How are you using Common Voice?

1 Like

Hey there! Thanks for getting in contact with us. That is correct. One speaker per set, one sentence per set.

2 Likes

Thanks for the quick response! What I do not understand is how you enforce this partitioning. If a speaker recorded their audio through several sessions, how do you link all these sessions to the respective speaker?

Hi @PKlumpp, each line has a client_id field:

  • client_id - hashed UUID of a given user
1 Like

If the user has a profile that is static, otherwise it is per session. That is the best that can be done technically.

2 Likes

And, one user can also have multiple accounts :confused:

1 Like

Hey @PKlumpp apologies I read your message Incorrectly. Thanks, @ftyers and @bozden for responding

3 Likes

I’d suppose many top contributors would be registered. But if one records 1000 sentences these sets only will have a single recording, am I correct?

Are there any statistics showing how many recordings are done by registered volunteers and how many are done per session?

1 Like

Each test/train/dev set is generated non-deterministically, meaning that they will vary from release to release even for minor updates. This is to avoid reproducing and perpetuating any demographic skews in each subsequent set.

The changing sets make them not very useful for comparing between releases, and for using pretrained models such as the publicly available XSLR model (https://arxiv.org/pdf/2006.13979.pdf) which is trained on the November 2019 CommonVoice. This is why I do my own split that is consistent between releases, based on a hash of the user ID, as do the authors of the XSLR model, also excluding duplicate prompts. XSLR uses the same splits as Rivière et al, which are publicly available. I don’t see the connection with demographics, assuming you let each set grow as new speakers are added over time.

As I’ve requested before, it would be great to have at least a “guest/not guest” flag for each user or session. Failing this, I agree that stats on guest vs. not would be useful. What would be better (for the data) is to encourage or require logging in.

2 Likes

I don’t see the connection with demographics, assuming you let each set grow as new speakers are added over time.

Imagine that in release 1.0 you have 20% female and 80% male voices. You establish a test set. Then in release 2.0 you have 40% female and 60% male voices. You maintain the previous test set, you still have the old distribution. If you let each set grow organically, insisting that all the previous recordings (20% female and 80% male be included) then you will potentially have fewer new female speakers (either in the training set or the test/dev sets). Does that make sense? Note that there is an additional issue on this subject regarding multiple recordings of the same sentence here.

I’d add that this is something that is being discussed and worked on at the moment, so watch this space :slight_smile:

As I’ve requested before, it would be great to have at least a “guest/not guest” flag for each user or session.

This definitely makes sense, but would require substantial engineering effort. You can do the same thing by just looking at the demographics column. If they have filled it out you know that they are logged in (with a profile), if they haven’t, probably they are not logged in.

What would be better (for the data) is to encourage or require logging in.

I think requiring people to log in is completely unreasonable.

1 Like

@ftyers, what about registered people with no info on gender and/or age? How do you deal with these?

If you let each set grow organically, insisting that all the previous recordings (20% female and 80% male be included) then you will potentially have fewer new female speakers (either in the training set or the test/dev sets). Does that make sense?

No I don’t think that makes sense. If you use a consistent hash function to bin the speakers into sets, then statistically, all the sets will tend towards the same demographic distribution. As new speakers are added, the demographics of each set will change along with the dataset-wide demographics.

A more general theoretical problem with inconsistent sets is that as this dataset becomes more important, different researchers will start tuning their model towards the training and development sets. This can eventually lead to a sort of community-wide overfitting. I assume this is part of the reason why Revière and others use their own split on CommonVoice, so that they will always have a test set uncontaminated with train and dev speakers.

How is that not already the case?

At the moment I believe they are neutral with respect. Although I would suggest putting them in the training data, and trying to keep the test data as balanced as possible.

1 Like

heyhillary’s comment was that “Each test/train/dev set is generated non-deterministically, meaning that they will vary from release to release even for minor updates.” This says to me that speakers can switch sets between releases. I haven’t checked this in the data to verify. But if the train/test/dev partitions are truly generated totally differently for each release, then the new test set could be contaminated with dev speakers, so you wouldn’t even want to re-use hyperparameters trained on the previous version’s dev set, for example.

The speakers will change between the sets, that is true, but you will never get the same speaker in train and dev or dev and test, which is the real issue. You will also never get the same sentence in the train, dev and test.

You are right about the hyperparameters of course, you need to do a separate hyperparameter search for each release, but I don’t see that as really a problem. Given the size differentials between releases I don’t think the hyperparameters would be particularly stable anyway. E.g. if you go from 10h to 100h you are unlikely to want the same hyperparameters and will need to re-tune them anyway.

I agree a new hyperparameter search should probably be done for going from 10h to 100h, but I won’t be redoing many hyperparameter searches with my very limited resources at hand. But consider that model architecture choices can also lead to a type of overfitting, and this effect can be amplified as more researchers depend on the same dataset. In any case, at least two research groups are already sharing their own split, and anyone is free to share theirs or to make their own split.

Guest users are a bigger concern to me, but it’s hard to know exactly how big the concern is. In the French training set release 7.0 for example, I count 79,057 out of 379,102 total sentences with both empty gender and age field. Any of that 20% could have come back for more sessions and ended up in dev and/or test. I suppose I could exclude that 20% from my dataset, if I’m worried about it.

I just found that the empty gender and age sentence count for French dev is 12,972 of 15,942 and test is 13,716 of 15,942 , so I’m not sure what’s going on with those and why they’re so different from the training set. I wonder if dev and test really consist of mostly guest users?

In general, yes this is true, but it also makes sense to have a canonical split per release that makes sense. There are (at least) several factors at play here which need to be balanced:

  • The privacy of contributors, who may not want to provide demographic information
  • The diversity and balance in the test set (e.g. we probably don’t want the test set to just reflect the demographic balance of the contributors)
  • The needs of researchers to have comprehensible and stable splits.

I definitely sympathise about the lack of compute power. I also think that the issue you raised with 13,716 of 15,942 clips in test lacking demographic information is problematic, we should definitely be trying to do better here. And I think that having a balanced test set is more important here than train or dev.

1 Like