I’d like to use DeepSpeech
for online phoneme recognition. Forced alignment is not an option.
As a phoneme set, I’m using X-Sampa
, where each phoneme consists of either 1 or 2 characters. The blank " "
is intentionally contained to treat each phoneme as a word and get proper time stamps.
The mapping looks like this:
{61: } {60: 6} {59: OI} {58: n} {57: n=} {56: I@} {55: dZ} {54: u:} {53: l=} {52: N=} {51: E} {50: e@} {49: f} {48: 3`} {47: g} {46: e} {45: @U} {44: j} {43: i:} {42: d} {41: A:} {40: D} {39: FILPAUSE} {38: 3:} {37: t} {36: T} {15: u} {14: S} {13: U@} {12: ?} {11: O:} {10: w} {9: l} {8: h-} {7: I} {6: 4} {1: {} {0: aU} {2: v} {3: h} {4: V} {5: m=} {16: s} {17: SILPAUSE} {18: k} {19: m} {20: eI} {21: r} {22: p} {23: } {24: h\} {25: NOISE} {26: Q} {27: @} {28: aI} {29: Z} {30: U} {31: z} {32: R} {33: b} {34: tS} {35: N}
Training on phoneme-aligned LibriSpeech
and inference solely based on the acoustic model work in principal. The results become worse, however, when a phoneme-based LM is plugged in. The LM is trained accordingly on phoneme-aligned LibriSpeech
in X-Sampa
and should be appropriate.
When using the LM, the ctc-decoder prefers 1-character phonemes, even though there’s no advice of the LM to do so and the mapping (see above) seems correct. Could anyone give me a hint what might be wrong?
- OS Platform and Distribution: Linux Ubuntu 18.04:
- followed instructions: https://github.com/mozilla/DeepSpeech/tree/master/native_client
- Bazel version (if compiling from source): 0.24.1