Merging Norwegian Nynorsk and Norwegian Bokmål

There is only one Norwegian spoken language…

I don’t understand why the decision to split Norwegian into different written forms was made.

If the purpose of the Common Voice project is

“building an open-source, multi-language dataset of voices that anyone can use to train speech-enabled applications,”

then this only makes sense if one wants to:
Make speech-enabled applications that either works for either 10% or 90% of the Norwegian population. But based on what each individual contributor has chosen as their preferred written language.

The decision to split Norwegian into written forms creates a weird arbitrary split. Because it is up to the speaker whether they use Norwegian Nynorsk or Norwegian Bokmål for their written form. Two people from the same small village, with the same dialect, could prefer different written forms.

It’s a choice of the speaker.
Sure, there is a strong correlation between the chosen written language and dialect of the speaker, but it is just that, a correlation and not a causation. And regardless, it is the same spoken language.

This just creates two sub-optimal datasets for the Norwegian spoken language based on the choice of written language of the contributors.

Why? Who needs that? How is that relevant for creating speech-enabled applications for Norwegian speakers?
What was the reason for splitting the language into two datasets in the first place?
Is there something that I am not understanding here? I feel like my brain is melting :rofl:

I will argue that for handling the Norwegian language in the Mozilla Common Voice project it only makes sense to have a single dataset, as it is a single SPOKEN language.
One dataset that includes all Norwegian speakers, regardless of written form. Where each sentence of the dataset has a Nynorsk and Bokmål version.

{
  "sentence_id_key": {
    "norwegian_nynorsk": "Ein blir sterkt gripen av personane og livet deira.",
    "norwegian_bokmål": "En blir sterkt grepet av personene og livet deres."
  }
}

When creating new sentences for the merged corpus, the users should choose in which written form they want to add a sentence in.

Recorders should choose their preferred written form, Norwegian Nynorsk or Bokmål, and be shown the corresponding form. Validators should see both versions.

This approach would create one dataset that covers all Norwegian speakers regardless of dialect while maintaining the relation of written form to audio recording.

This could be done without losing the progress already made for Norwegian Nynorsk and Norwegian Bokmål datasets by merging them, making each sentence into a sentence pair.

I am also a software developer that would be interested to help if doing such a merge would become relevant

Much love
Eskil

2 Likes

I am not a CV staff member, just a community member, however I believe this is due to the current CV database design, which does not distinguish between spoken language and orthographic language. Several spoken languages have multiple written forms - for example, the difference in spelling between American English and Australian English:

“my neighbor harbors views which systematize racism”
“my neighbour harbours views which systematise racism”

Azeri, for example (from memory) can be written in Latin script (like English), in Arabic script and I believe also in Cyrillic - three orthographies for one spoken language.

My understanding is that the Common Voice roadmap is addressing the relationship between spoken language and orthographies in the future, but @Gina_Moape or @jesslynnrose may wish to comment further.

I’m not sure if it is the same thing, but “sentence variants” are under development. See the latest merged (but not-yet-released) PR’s like #4476 through #4482 (see GitHub). I think these will handle -ise / -ize styles, but I don’t know how much it is meant to accomplish (like complete different writing systems).

I’m sure @ftyers will have an answer to that :slight_smile:

Thanks for the response.

This is true, and a dataset each for American, Aussie and British English doesn’t really make that much sense :sweat_smile:

So the likely reason for the Norwegian language ending up with two datasets was the limitations of the CV database design? But I still don’t understand why two was made instead of one?

This splits the speaker base of about 5 million native speakers in terms of region and user choice, which for a common voice project dataset, seems odd.

I might sound a bit critical here and will try to tread carefully and give some context.

About 10% of the Norwegian population uses Norwegian Nynorsk as their written language, while 90% use Norwegian Bokmål.

Despite being a minority, Nynorsk completed its first corpus in January 2022, while Bokmål is still collecting sentences and progressing slowly :sleeping:.

One might expect the opposite given the user ratio.

I’ve noticed a similar ratio with Catalan and Spanish, though in their case they are two different languages.

Catalan, with around 9 million speakers in Spain, has completed 72.52 GB worth of audio recordings. In contrast, Spanish, with 45-47 million speakers in Spain and approximately 493 million native speakers worldwide, has only 46.97 GB . The higher engagement from a minority is in my opninion very evident and only natural, as language and identity go hand in hand.

In Norway we see the same engagement with Norwegian Nynorsk users, the issue here is that its only one language. Not two.

The division might not align with the Common Voice project’s goal of creating a beneficial resource for all Norwegian speakers.

If you want to train models on the Norwegian language, you will need all the dialects and as many users as possible. So the two datasets will likely be manually merged regardless.

1 Like

I have also recently noticed the new variant field. Was actually wondering what it was for the last time i looked through the tsv files :grinning:.

I am unsure if they could be used to facilitate an alternative written language for a whole dataset though.

Currently, the variant field in the .tsv files is for the spoken language (voice corpus), which comes from demographic info set in user profiles (if any). They are introduced in 2022 and recently a new call has been made to add more. You can see currently available ones all together here.

The new PRs I mentioned above are for the sentence variants. If you put many variants into a single dataset, people tend to vote NO to sentences written in another variant. Like -ise, -ize cases in English or Anatolian Turkish vs Cypriot Turkish, where writing and meaning of words might change.

1 Like

One question remains: If you put two scripts into a single dataset and train on all of them, what will an ASR model output?

  • The one with more data?
  • Mixed?
  • Should they be treated like multi-lingual models?
  • Transliteration?

Hei (eg kan norsk og forstår vel språksituasjonen i Noreg, men skal på engelsk, slik at anna folket kunne forstå)

In my opinion this is a good idea. The reason Nynorsk started sooner is because there was more community movement behind it.

We are currently adding sentence variants (so sentences could be labelled e.g. no-nynorsk, no-bokmaal

The main current issue is decoupling localisation. If you are interested, we could discuss potential ways forward on Matrix. Participants from the Nynorsk community should also be involved.

It is also supposed to support having different writing systems and orthographies.

1 Like