It is my current understanding that the train/dev/test sets are completely re-generated each release with no guarantee that the previous split data will be reflected so I would caution against using the released splits as an academic source. See this thread: Dataset split best practices?
1 Like