Danish sentences with old spelling (and even more with old capitalization)

I’ve been reviewing some Danish sentences, and, of those that come from Project Gutenberg, a lot of them contain words that are spelled differently now from how they were spelled back when they got written. Which means there’s plenty of sentences I have to reject. And there’s two issues I’m noticing a lot: one is that words that are nowadays spelled with the letter å have two a’s instead (for example, “så” is written “saa”), the other is all nouns being capitalized (I might have let a few sentences with that get through, I hope that doesn’t cause too much trouble. I’m pretty sure a lot of speech recognition tools ignore capitalization anyway, and if it’s the first word in the text that’s capitalized, then I let that through, but what should I do? Reject everything with old spelling or old capitalization? Reject old spelling but not old capitalization? Leave some sentences neither accepted or rejected? I don’t remember exactly when the spelling changed, but the capitalization thing seems to happen to a lot of sentences from that source, if not all of them. It would be nice if it were possible to fix the sentences so that they can be used.

It might be possible to get most of the double a’s fixed by find-replacing aa with å. As for the capitalization, there are some words (such as “en” and “et”) that are much more likely to appear before common nouns (which aren’t capitalized anymore) than before proper nouns (which still start with a capital letter nowadays), and some words (“Herr” seems to be pretty common in these sentences, but “Fru”, “Frøken”, “Hr.”, “Fr.” and “Frk.” might also appear) that are usually followed by proper nouns, which are also capitalized nowadays.

Thanks for getting in touch! :slight_smile: Could you report this issue on the sentence collector issues page? https://github.com/Common-Voice/sentence-collector/issues

Thanks! I found that page before you replied, and there was another issue about a lot of words misspelled in another language, Punjabi. And I wrote a comment under it, mentioning that a lot of Danish sentence had a lot of misspelled words too. Is that enough or should I add the Danish thing as a separate issue?

The other issue was https://github.com/common-voice/sentence-collector/issues/317 , and it says my comment on it seems is from the 27th of January 2021, while all other comments on that issue seem to be from the 16th of August 2020.

Yeah, it’s better to start a new issue. That issue is marked as “Waiting for feedback from the reporter” so probably won’t get checked until there is further information about that.

Thanks! I’ve done that now. The new issue is https://github.com/common-voice/sentence-collector/issues/411 " Lots of Danish sentences have spelling and capitalization mistakes because they were written more than 80 years ago".

I am rejecting most of these sentences from “Excentriske noveller”. I wonder how many there are. I suppose it would be better if they were all rejected, instead of that we would have to go through them all.

I managed to get through all review sentence. I suppose there must be around 250 pages of “Excentriske noveller”. I hope that no one adds these kind of sentences again.

https://github.com/common-voice/sentence-collector/issues/411 has been closed and no activity in this topic for a while, but when I’ve recorded sentences (or validating others’ recordings) I’ve still encountered a good number of sentences in “older young modern Danish” (ie., pre‐1948 Danish [see footnote]) with double‐a’s in place of å’s, capitalised common nouns, and kunde/vilde/skulde in place of kunne/ville/skulle.

I’ve just “rolled with it” for some and reported others as incorrect grammar but I’ve been wondering how strict–lenient about it I should be. Should I just report everything that isn’t “younger young modern Danish”?

Also, the GitHub issue states that a number of sentences have been removed… were any of them reintroduced? Or is it just that a lot of sentences still got missed?

footnote on pre-1948 Danish

The 1948 Danish spelling reform is the most recent major spelling reform which is just around 75 years old. This means that there are very, very few texts written by authors who passed away 70 years or more ago under this current “younger young modern Danish” paradigm that can be used to source sentences. :frowning:

@jesslynnrose @Gina_Moape (sorry I don’t know Dmitrij’s handle on Discourse) - this is a great example of where BCP-47 could be applied. BCP-47 can be used to distinguish between orthographies - here we have the pre-1948 Danish spelling and the post-1948 Danish spelling used in the same text corpus. This would create inaccuracies in trained models, because kunne and kunde would be treated as different words, when they are the same words.

I don’t speak Danish, but I would also hazard a guess that they are spoken using the same intonation / inflection - so the recorded audio would be the same, but the written transcription would be different.

If we used BCP-47 to tag sentences, we would end up with axes of variation on both orthography and accent / dialect such as:

  • Danish written before 1948 and spoken with xxx accent (say Jutlandic)
  • Danish written before 1948 and spoken with yyy accent (say East Danish)
  • Danish written after 1948 and spoken with xx accent (say Jutlandic)
  • Danish written after 1948 and spoken with yyy accent (say East Danish)

The significance of the orthography would vary with language - for example, many Indigenous languages that are traditionally oral languages have significant variation in orthography (see Sasha Wilmoth’s piece here on Indigenous Australian orthography) which makes machine learning more difficult, because the variance needs to be accounted for. Older orthographies would likely contain words that are more frequent in the period of the orthography - the “tenements and buggy carts” phenomenon - words like “large language model” or “generative AI” didn’t exist 10 years ago.