Phoneme Recognition - LM worsens results

I’d like to use DeepSpeech for online phoneme recognition. Forced alignment is not an option.

As a phoneme set, I’m using X-Sampa, where each phoneme consists of either 1 or 2 characters. The blank " " is intentionally contained to treat each phoneme as a word and get proper time stamps.
The mapping looks like this:

{61: } {60: 6} {59: OI} {58: n} {57: n=} {56: I@} {55: dZ} {54: u:} {53: l=} {52: N=} {51: E} {50: e@} {49: f} {48: 3`} {47: g} {46: e} {45: @U} {44: j} {43: i:} {42: d} {41: A:} {40: D} {39: FILPAUSE} {38: 3:} {37: t} {36: T} {15: u} {14: S} {13: U@} {12: ?} {11: O:} {10: w} {9: l} {8: h-} {7: I} {6: 4} {1: {} {0: aU} {2: v} {3: h} {4: V} {5: m=} {16: s} {17: SILPAUSE} {18: k} {19: m} {20: eI} {21: r} {22: p} {23:  } {24: h\} {25: NOISE} {26: Q} {27: @} {28: aI} {29: Z} {30: U} {31: z} {32: R} {33: b} {34: tS} {35: N}

Training on phoneme-aligned LibriSpeech and inference solely based on the acoustic model work in principal. The results become worse, however, when a phoneme-based LM is plugged in. The LM is trained accordingly on phoneme-aligned LibriSpeech in X-Sampa and should be appropriate.
When using the LM, the ctc-decoder prefers 1-character phonemes, even though there’s no advice of the LM to do so and the mapping (see above) seems correct. Could anyone give me a hint what might be wrong?

1 Like

Is it allowed that the alphabet consists of multi-character strings, i.e. a line in alphabet.txt contains more than one character?

utils/check_characters.py returns only single-character strings, as its name suggests. Am I supposed to map multi-character strings to single-character strings in alphabet.txt?

I don’t think this is supposed to work.

How do you produce the LM ?

For training the acoustic model based on multi-char strings in alphabet.txt, it’s sufficient to adapt function:
https://github.com/mozilla/DeepSpeech/blob/master/util/text.py#L44
Then DeepSpeech’s internal mapping looks correct, as shown above:

{61: } {60: 6} {59: OI} {58: n} {57: n=} {56: I@} {55: dZ} {54: u:} {53: l=} {52: N=} {51: E} {50: e@} {49: f} {48: 3`} {47: g} {46: e} {45: @U} {44: j} {43: i:} {42: d} {41: A:} {40: D} {39: FILPAUSE} {38: 3:} {37: t} {36: T} {15: u} {14: S} {13: U@} {12: ?} {11: O:} {10: w} {9: l} {8: h-} {7: I} {6: 4} {1: {} {0: aU} {2: v} {3: h} {4: V} {5: m=} {16: s} {17: SILPAUSE} {18: k} {19: m} {20: eI} {21: r} {22: p} {23:  } {24: h\} {25: NOISE} {26: Q} {27: @} {28: aI} {29: Z} {30: U} {31: z} {32: R} {33: b} {34: tS} {35: N}

The decoder itself works with std::vector<int>, for the LM-scorer converts it to a string of words (in our case phonemes, e.g. “E g N k”) using the mapping. I debugged the decoder and the words look as expected. Thus, I’m puzzled why the results are worsened with the LM.

The language model is trained directly with `kenlm, i.e. not via the Mozilla DeepSpeech helper script. In the input file for LM training, phonemes are blank-separated.

bin/lmplz -o 4 --text <transcriptionFile> --arpa <output>.arpa
bin/build_binary <output>.arpa <output>.klm
bin/build_binary trie <output>.arpa <output>.trie

Right, what does it looks like?

It’s likely not the only place. We have some code around in native_client/ctcdecoder that deals with the Alphabet.

You might need to check native_client/alphabet.h as well as native_client/decoder alphabet’s usage.

I checked (i.e. debugged and printed) that, as well.
In alphabet.h the private members label_to_str_ and str_to_label_ of Alphabet are set in method deserialize (https://github.com/mozilla/DeepSpeech/blob/master/native_client/alphabet.h#L48). I.e. they were stored during training and get now extracted from output_graph.pb. They both look as desired. Actually the mapping, I keep pasting, comes from a print statement in alphabet.h.

The input file for the LM training looks lie this (blank-separated phonemes). Snippet:

SILPAUSE g @U SILPAUSE d u: j u: h i: r SILPAUSE b V t I n l e s D V n f aI v m I n V t s SILPAUSE D V s t e r k eI s g r @U n d b I n i: T V n I k s t r O: r d V n e r i: w eI t SILPAUSE { t D I s m @U m V n t SILPAUSE D V h @U l s @U l V v D i: @U l d m { n s i: m d s e n t 3: d I n h I z aI z w I tS b I k eI m b l V d S Q t SILPAUSE D V v eI n z V v D V T r @U t s w e l d SILPAUSE h I z tS i: k s V n d t e m p V l z b i: k eI m p 3: p V l e z D @U h i: w V z s t r V k w I T e p V l e p s i: SILPAUSE

Snippet of the .arpa file (1-grams):

-1.5840585      d       -1.295391
-1.6198325      u:      -1.1033778
-1.5956589      j       -1.2067125
-1.5956589      h       -1.25041
-1.5840585      i:      -1.2728182

Both look fine to me.

Could you share some examples of expected VS actual results ?

Could it just be the way the scorer works and that tuning alpha and beta would help ?

I just want to make sure there’s nobody tripping over here and the whole training and inference code does pulls your phoneme-based system.

Here’s a sample comparison of ground truth vs. predictions:

  • Audio: Librispeech/dev-clean/84/121123/84-121123-0000.wav
  • Grapheme transcription: “Go, do you hear”

In Phonemes:

Ground-Truth:

['SILPAUSE', 'g', '@U', 'SILPAUSE', 'd', 'u:', 'j', 'u:', 'h', 'i:', 'R', 'SILPAUSE']

Without LM:

['SILPAUSE', 'g', 'R', '@U', 'p', 'SILPAUSE', 'd', 'u:', 'h', 'i:', 'R', 'SILPAUSE']

With LM (default values: lm_alpha=0.75, lm_beta=1.85):

  • only takes 1-char strings
  • tries to find similarly sounding phonemes instead, e.g. h for SILPAUSE
['h', 'g', 'V', 'p', 'd', 'I', 'h', 'I']

Strangly, setting lm_alpha=0 and lm_beta=0 produces the same result.
With LM (lm_alpha=0 and lm_beta=0):

['h', 'g', 'V', 'p', 'd', 'I', 'h', 'I']

I cannot explain the last result even after debugging (also double-checked that the LM-weights are set as described).

Could it just be the way the scorer works and that tuning alpha and beta would help ?

Probably first the issue of the last result (LM with lm_alpha=0, …) needs to be fixed to find that out.

I just want to make sure there’s nobody tripping over here and the whole training and inference code does pulls your phoneme-based system.

Sure. Thanks for your help, so far!

UPDATE:
I mapped the 2-character phonemes to 1-character code points for training DeepSpeech and the LM. The results now improve when plugging in the LM.

The question remains why the 2-character strings in “alphabet.txt” don’t work and where the code line is to change.

Ideas are welcome!

I don’t remember well the specifics but I think it should work like in some languages. Maybe @reuben can remember better than me ?

Classes with two codepoints don’t work because the training code iterates over Unicode codepoints in the transcript strings in the CSVs. So if you have a transcript like :u it’ll be converted into two different classes, one for : and one for u. You should make up your own single codepoint mapping (that can then be converted back into whatever you want), as you’ve already discovered.

Multi-character prediction classes can lead to tricky cases, for example what happens when you have overlapping classes like :u and a: and a transcript that’s a:u. You can make individual implementation choices for each case, but then you have to document them, make sure they don’t interact badly with any other components, etc.

The relevant code is in util/text.py, in the Alphabet.encode method. For now it’s a simple for codepoint in transcript: iteration.

1 Like

@reuben Phoneme codepoints are included in Unicode, see for example here. So would it not be possible to use these and the UTF-8 support you recently added?

The point is that each label/class has to be a single codepoint. Not all of those phonemes are a single codepoint, for example β̞̊ which is composed of three codepoints: https://apps.timwhitlock.info/unicode/inspect?s=β̞̊

Using UTF-8 mode would also be a way to solve this without having to make up your own translation table.

1 Like

AFAIK X-Sampa maps one-to-one and onto to IPA which has for each character a single unicode codepoint. So “pure” X-Sampa I’d guess is fine.

The problem, if I understand it, is diacritics, which take the single codepoint [t] to [tʰ], which consists of multiple codepoints.

That’s unfortunate as it means any transfer learning we might do with phonemes, assuming we use diacritics, is going to have to be a bit of a hack.

Not every diacritic, as some of them have normalized forms, like é. In the case of it’s also two extended grapheme clusters (see https://repl.it/repls/InnocentRepulsiveFile ), so that’s also a case where even using Unicode segmentation algorithms wouldn’t be enough.

1 Like