I’ve been reviewing some Danish sentences, and, of those that come from Project Gutenberg, a lot of them contain words that are spelled differently now from how they were spelled back when they got written. Which means there’s plenty of sentences I have to reject. And there’s two issues I’m noticing a lot: one is that words that are nowadays spelled with the letter å have two a’s instead (for example, “så” is written “saa”), the other is all nouns being capitalized (I might have let a few sentences with that get through, I hope that doesn’t cause too much trouble. I’m pretty sure a lot of speech recognition tools ignore capitalization anyway, and if it’s the first word in the text that’s capitalized, then I let that through, but what should I do? Reject everything with old spelling or old capitalization? Reject old spelling but not old capitalization? Leave some sentences neither accepted or rejected? I don’t remember exactly when the spelling changed, but the capitalization thing seems to happen to a lot of sentences from that source, if not all of them. It would be nice if it were possible to fix the sentences so that they can be used.
It might be possible to get most of the double a’s fixed by find-replacing aa with å. As for the capitalization, there are some words (such as “en” and “et”) that are much more likely to appear before common nouns (which aren’t capitalized anymore) than before proper nouns (which still start with a capital letter nowadays), and some words (“Herr” seems to be pretty common in these sentences, but “Fru”, “Frøken”, “Hr.”, “Fr.” and “Frk.” might also appear) that are usually followed by proper nouns, which are also capitalized nowadays.