Dialects with Deep Speech

I want to make a speech recognition Model that can transcribe dialects. Starting with bavarian (German dialect).
I played around a bit and noticed that Deep Speech has a hard time to recognise abreviated words.
E.g. there is “a”, in the Sense of “ein”(one/a), and “a” in the Sense of “auch” (also/too). Both Sound exactly the Same.
Can deep speech determine by context that it would need “ein” or “auch” by context, or should I first let it transcribe to Text “a” (litteraly) and make a secondary translating Model that Takes the bavarian Text and Turns it into proper German?
By my current understanding I need the second approach as deep speech only Outputs single Charakters and then tries to assign words by searching all known words for the Most likely one. In this Case, any recomendations on Something as easy to use for translating as DeepSpeech for speech2text?
Thanks in advance.

You are right, Deepspeech “guesses” the letters that are spoken. So you need to translate “auf die Nacht” to “abends” afterwards.

You would need some larger corpus of translations to train sth as easy and as far as I know, there is none.

I had a similar discussion about the German allemanic dialects. Since the German common voice only allows German accents, but no dialects you would have to start to collect Bavarian as a new language in Common Voice. You could use the bavarian wikipedia with over 30 000 articles as a base for the sentences.

But as far as I understand you, you want to use the German model to transcribe Bavarian. This is an interesting idea, I have no idea how well this could work.

EDIT: maybe you can use the German network and only use the Bavarian Wikipedia as a base for the language model.

Thank you for the suggestion. I took a look at the bavarian wikipedia but found, that it is mostly in a different sub-dialect (Upper Palatinate) and thus hardly of interest.
I will start by collecting conversations of mine with my native dialect (lower bavarian).

@othiele I found that https://opennmt.net/ is also quite straight forward (like DeepSpeech)

Dialects are also important for us with te reo Māori. We haven’t started to tackle the problem, but many people have asked us about it and we’ve thought about it. For example, should we train a base model (te reo Māori) then use transfer learning to train dialect specific models (because we don’t have as much dialect data compared with “general” te reo Māori). But then a user would need to select their dialect. There are some consistent rules in terms of translating between dialects so translating text could be straight forward.

Or can you train a single model that knows multiple dialects? One example for us is the consonant “wh” which has an English “f” sound for most dialects but in the Far North it’s pronounced more like an English “h”. Not surprisingly, DeepSpeech recognizes both pronunciations I think in part because of the language model but also we have both pronunciations in our corpus.

Definitely keen to hear how you go and what you learn along the way as I’m sure it’ll be useful for many other languages.

I’m working on a model for Kabyle language. I have problems similar to yours in my results, which is due to the dialect differences. But when it comes to phonetic ambiguities, this can be partially remedied with a good language model.

Postagging can also help for post processing

Interesting, how would you do that in detail?

The post processing (for Kabyle) is applied after decoding. kabyle syntax uses “-” to separate affixes (prefix and sufix) and words. But, some affixes are same as some words (ambiguity). As the “-” has no vocal representation, it’s not worthy to keep it before training (we train without it - pre processing), then the generated sentence is syntactically wrong. We apply this post processing to re-construct the right syntax using a postag model for kabyle but also for word disambiguation.

1 Like

I’ve seen a project recently which does something similar, transcribing swiss-german dialect to standard German. They only had a small audio corpus and used transfer-learning with a German model. The language model was generated with standard German texts. It’s here:

1 Like