I don’t see the connection with demographics, assuming you let each set grow as new speakers are added over time.
Imagine that in release 1.0 you have 20% female and 80% male voices. You establish a test set. Then in release 2.0 you have 40% female and 60% male voices. You maintain the previous test set, you still have the old distribution. If you let each set grow organically, insisting that all the previous recordings (20% female and 80% male be included) then you will potentially have fewer new female speakers (either in the training set or the test/dev sets). Does that make sense? Note that there is an additional issue on this subject regarding multiple recordings of the same sentence here.
I’d add that this is something that is being discussed and worked on at the moment, so watch this space
As I’ve requested before, it would be great to have at least a “guest/not guest” flag for each user or session.
This definitely makes sense, but would require substantial engineering effort. You can do the same thing by just looking at the demographics column. If they have filled it out you know that they are logged in (with a profile), if they haven’t, probably they are not logged in.
What would be better (for the data) is to encourage or require logging in.
I think requiring people to log in is completely unreasonable.