Norwegian speech and its dialect and sentence problems

odinho · January 14, 2019, 1:54pm

Hi! I’d like to add data for Norwegian speech recognition. Specifically to help with recognising dialects (how people actually speak).

I’ve been collecting sentences. And have 500 so far.

Sentences will need to be standardized and written in one form. That will be a huge problem.

Infinitiv a/e

Even the Norwegian translation of voice.mozilla.org now is using two distinct rules for the simplest of rules. It has both å lage and å laga for “to make”.

I want to use a-infinitiv (å laga), because it is further from Danish-Norwegian and will therefore be simpler to disambiguate.
A negative is that most school text books and the state is using e-infinitiv as far as I’ve seen.

Different words for the same thing

To be can be either å verta or å bli. People say it differently based on their dialect. Both is allowed to be written (sadly). There’s two ways to go here:

demand all written text always is one of them, and people just say whatever they say in their dialect.
get both in, and have speakers speak it as written

The problem with number 2 is that it will be unclear to the speaker when they should speak as written, and when they should not. As an example “to see” is written “å sjå”. But if we have a rule of “say it like it’s written” for “å verta” and “å bli” (and similar pairs), it will be hard for a speaker ho says “å se” in her dialect to know that she should not say “å sjå” in this particular case. She would have to know that “å se” is not a valid way to write to see.

The problem with number 1 is that we’d need to decide on it. And also that saying “jei blir gla’” and getting out “eg vert glad” is a very far stretch on the “what you hear” to “what you get”. However, English has a lot of these weird things, so the model might be able to learn it.

Marking of dialect in profile

Before we actually open up to recording, we need to have dialect marking in the profile. It also needs to be in the exported data.

What do you think? Do any other languages have similar problems? My impression is that most countries have a more accepted standard language. Foreigners often will call “Standard austnorsk” the same for Norwegian, but that’s really not true.

mkohler · July 30, 2018, 4:47pm

I’m not sure I understand the problem here. The sentences will be shown to the speaker as it is, there should be no translation whatsoever. IMHO the corpus data should just have enough of each dialect so the model would understand all different ways of saying something. Or am I missing something here?

odinho · July 30, 2018, 5:14pm

You are probably missing the fact that we have two competing written languages.

It should not write dialect, it should write Norwegian nynorsk. So the sentences will have to be in Norwegian though the speech will be in dialect. The two written languages are Danish-Norwegian (bokmål, “book language”) and Norwegian (nynorsk, “new Norwegian”). Bokmål is the most used one. It is also getting all the commercial interest. I am talking about Nynorsk here. The speech data could probably be used to train a model that writes Bokmål too, it would only need to translate the written sentences.

In Danish-Norwegian “to look” is written as “å se” and nothing else is allowed.
In Norwegian it is written as “å sjå” and nothing else is allowed.
E.g. in my dialect I would say “å se”.

So the problem would be confusion for the reader.

The Norwegian Speech-To-Text should in other words write “å sjå” if I say “å se”. The problem is that when I read a sentence “eg likar å sjå TV”, I might actually SAY [eg lige å sjå teve] (which isn’t correct per my dialect) instead of [eg lige å se teve].

This is a unique problem I think in Norwegian having two written languages, and to boot hundreds of dialects. I think there’s a good reason commercial Speech-To-Text want to stay well clear of dialects.

Thanks a lot for the reply! I’ve been bouncing between solution 1 and solution 2 for different words of same meaning. Though I guess doing solution 2 and putting some more complexity to the reader and verifier isn’t too bad and might be the best way overall.

mkohler · July 30, 2018, 5:36pm

Thanks for the explanation! I’ll let @mhenretty chime in here

mhenretty · July 31, 2018, 9:58am

Thanks for the thoughtful question @odinho, and for the discussion @mkohler!

I am inclined to agree with you, but honestly I rely on local language experts like you to advise us rather than the other way around. The biggest thing is that we need to be more explicit about what is a “valid” utterance vs. “invalid”. That’s something we are discussing here:

odinho · July 31, 2018, 3:38pm

It’s my mother tongue, but I’m far from an expert. I’ve asked some actual experts too. I’ll continue working on it, and keep collecting sentences.

Mittineague · July 31, 2018, 8:14pm

From what I can see, so far there is only
language-REGION
when what is needed is
language-REGION-Variant

Or would even that be insufficient?

mhenretty · August 1, 2018, 10:40am

We are deciding this on a case-by-case basis. The decision usually hinges upon how different regional dialects are, and especially how they are written. We have not yet explored variants.