It is my current understanding that the train/dev/test sets are completely re-generated each release with no guarantee that the previous split data will be reflected so I would caution against using the released splits as an academic source. See this thread: Dataset split best practices?
Related topics
| Topic | Replies | Views | Activity | |
|---|---|---|---|---|
| Single Sentence Record Limit feature release | 18 | 3118 | June 13, 2022 | |
| Dataset split best practices? | 23 | 4860 | December 23, 2019 | |
| Common Voice v1 corpus design problems, overlapping train/test/dev sentences | 2 | 2238 | April 3, 2018 | |
| Issues in the Romanian dataset | 7 | 336 | February 28, 2025 | |
| Do the Common Voice datasets contain multiple audio samples for the same text in the same language? | 9 | 2245 | April 20, 2020 |