I am not sure. I compared the english file from the dutch and the german collection and the beginning of these files looks identical, but they donât have the same size, the dutch one is much bigger. (297 mb vs 307 mb)
Edit: looks like the biggest file is the fr-en collection, but the english file there is just as big as in the en-nl collection.
After searching through the file for some typical topics I think the percentage of problematic sentences is not very high. There are a lot of sentences with strong opinions about all kind of political topics, but almost all of them use a acceptable language. I am for the QA process instead of the sentence collector.