No I was talking about the things they say in their speeches. I am not sure how much this really a problem in this dataset, but I know that a few MEPs used words like “scum” for certain groups in their speeches.
Good point, I will do that. I will also filter out sentences with letters that are not part of the German alphabet, this filters out many words that are hard to pronounce.
IMO the sentences are equal in quality to the sentences from wikipedia. But they can’t be used without preprocessing them. For example in the German dataset many sentences start with some letters indicating the original language (like EN: blabla). One should delete things like this first but after that it looks fine.