Merging Norwegian Nynorsk and Norwegian Bokmål

bozden · May 22, 2024, 9:39pm

Currently, the variant field in the .tsv files is for the spoken language (voice corpus), which comes from demographic info set in user profiles (if any). They are introduced in 2022 and recently a new call has been made to add more. You can see currently available ones all together here.

The new PRs I mentioned above are for the sentence variants. If you put many variants into a single dataset, people tend to vote NO to sentences written in another variant. Like -ise, -ize cases in English or Anatolian Turkish vs Cypriot Turkish, where writing and meaning of words might change.