There is only one Norwegian spoken language…
I don’t understand why the decision to split Norwegian into different written forms was made.
If the purpose of the Common Voice project is
“building an open-source, multi-language dataset of voices that anyone can use to train speech-enabled applications,”
then this only makes sense if one wants to:
Make speech-enabled applications that either works for either 10% or 90% of the Norwegian population. But based on what each individual contributor has chosen as their preferred written language.
The decision to split Norwegian into written forms creates a weird arbitrary split. Because it is up to the speaker whether they use Norwegian Nynorsk or Norwegian Bokmål for their written form. Two people from the same small village, with the same dialect, could prefer different written forms.
It’s a choice of the speaker.
Sure, there is a strong correlation between the chosen written language and dialect of the speaker, but it is just that, a correlation and not a causation. And regardless, it is the same spoken language.
This just creates two sub-optimal datasets for the Norwegian spoken language based on the choice of written language of the contributors.
Why? Who needs that? How is that relevant for creating speech-enabled applications for Norwegian speakers?
What was the reason for splitting the language into two datasets in the first place?
Is there something that I am not understanding here? I feel like my brain is melting
I will argue that for handling the Norwegian language in the Mozilla Common Voice project it only makes sense to have a single dataset, as it is a single SPOKEN language.
One dataset that includes all Norwegian speakers, regardless of written form. Where each sentence of the dataset has a Nynorsk and Bokmål version.
{
"sentence_id_key": {
"norwegian_nynorsk": "Ein blir sterkt gripen av personane og livet deira.",
"norwegian_bokmål": "En blir sterkt grepet av personene og livet deres."
}
}
When creating new sentences for the merged corpus, the users should choose in which written form they want to add a sentence in.
Recorders should choose their preferred written form, Norwegian Nynorsk or Bokmål, and be shown the corresponding form. Validators should see both versions.
This approach would create one dataset that covers all Norwegian speakers regardless of dialect while maintaining the relation of written form to audio recording.
This could be done without losing the progress already made for Norwegian Nynorsk and Norwegian Bokmål datasets by merging them, making each sentence into a sentence pair.
I am also a software developer that would be interested to help if doing such a merge would become relevant
Much love
Eskil