It is for example quite noticeable that a substantial part of the Swedish sentences come from the opensubtitles.org subtitles for the Netflix movie Budapest.
A sentence such as
Vi organiserar svensexor i Budapest för franska brudgummar.
(We organize stag parties in Budapest for French grooms.)
is quite unique and the string “Budapest” can be found 15 times in sentence-collector.json
The recommendation of opensubtitles.org should at least be with a caveat that the user should make sure that the subtitle isn’t a derivative work of a copyrighted work (not in public domain or CC0).
(It should also be noted that this type of homemade subtitles often suffer from low quality. I found missing spaces, missing letters and misspellings among the Swedish sentences. I would guess there were many more before the review process but with low quality in you are almost bound to have a few slip in.)