Danish sentences with old spelling (and even more with old capitalization)

I’ve been reviewing some Danish sentences, and, of those that come from Project Gutenberg, a lot of them contain words that are spelled differently now from how they were spelled back when they got written. Which means there’s plenty of sentences I have to reject. And there’s two issues I’m noticing a lot: one is that words that are nowadays spelled with the letter å have two a’s instead (for example, “så” is written “saa”), the other is all nouns being capitalized (I might have let a few sentences with that get through, I hope that doesn’t cause too much trouble. I’m pretty sure a lot of speech recognition tools ignore capitalization anyway, and if it’s the first word in the text that’s capitalized, then I let that through, but what should I do? Reject everything with old spelling or old capitalization? Reject old spelling but not old capitalization? Leave some sentences neither accepted or rejected? I don’t remember exactly when the spelling changed, but the capitalization thing seems to happen to a lot of sentences from that source, if not all of them. It would be nice if it were possible to fix the sentences so that they can be used.

It might be possible to get most of the double a’s fixed by find-replacing aa with å. As for the capitalization, there are some words (such as “en” and “et”) that are much more likely to appear before common nouns (which aren’t capitalized anymore) than before proper nouns (which still start with a capital letter nowadays), and some words (“Herr” seems to be pretty common in these sentences, but “Fru”, “Frøken”, “Hr.”, “Fr.” and “Frk.” might also appear) that are usually followed by proper nouns, which are also capitalized nowadays.

Thanks for getting in touch! :slight_smile: Could you report this issue on the sentence collector issues page? https://github.com/Common-Voice/sentence-collector/issues

Thanks! I found that page before you replied, and there was another issue about a lot of words misspelled in another language, Punjabi. And I wrote a comment under it, mentioning that a lot of Danish sentence had a lot of misspelled words too. Is that enough or should I add the Danish thing as a separate issue?

The other issue was https://github.com/common-voice/sentence-collector/issues/317 , and it says my comment on it seems is from the 27th of January 2021, while all other comments on that issue seem to be from the 16th of August 2020.

Yeah, it’s better to start a new issue. That issue is marked as “Waiting for feedback from the reporter” so probably won’t get checked until there is further information about that.

Thanks! I’ve done that now. The new issue is https://github.com/common-voice/sentence-collector/issues/411 " Lots of Danish sentences have spelling and capitalization mistakes because they were written more than 80 years ago".