Hey @PKlumpp,
"Each test/train/dev set is generated non-deterministically, meaning that they will vary from release to release even for minor updates. This is to avoid reproducing and perpetuating any demographic skews in each subsequent set. "
For more details on the metadata please check out the github: GitHub - common-voice/cv-dataset: Metadata and versioning details for the Common Voice dataset
I would love to learn more about your project, if you would like to please feel free to share on this thread: Talk to us! How are you using Common Voice?