How to split data on the basis of speakers?

agarwalaashish20 · December 21, 2019, 7:41pm

Hello Team,

Thank you for publishing a detailed paper on the Corpora creation.

In the paper, it is mentioned that the splits for Train, Test and Dev were done in such a way that one speaker’s recordings are only present in one data split.

““We made dataset splits (c.f. Table (2)) such that one speaker’s recordings are only present in one data split.””

I want to know, is there any way if the same can be applied for any other open-source corpus? If so, could you please point me to the code.

Thank you.

phirework · December 23, 2019, 5:38pm

Hi @agarwalaashish20! The splits are being calculated by this python script, which is open source. Feel free to adapt or modify it to your needs: https://github.com/mozilla/CorporaCreator