Labeling with phonems rather than letters

I’ve searched through this forum but haven’t found any documentation on why Deep Speech uses letters as input rather than phonemes. Is there any practical reason to do so?
My team at University and I make our own spoken corpora and we wonder if there would be any gain from using phonemes instead of letters? Would it require a lot of pre/post-processing or Deep Speech can be easily adapted for doing so? Thank you in advance.

You could use phonemes, but then you’d always have to have a phonetic spelling for whatever language you train on. This would make the STT hurdle for something like Hakha Chin higher.

Thank you for response. Don’t quite understand you, isn’t using phoneme labels just a matter of changing alphabet.txt?

I understand that using phoneme-based alphabet would require a lot of professional phoneticians’ labor, but how do you consider the gain we would have with it? Phonemes better reflect the sound people utter (‘o’ in ‘cow’ and ‘o’ in ‘dog’ are very different sounds), it let us handle out-of-vocabulary words better…? I’ve searched through web but don’t see any substantial discussion over this.

Yes, changing alphabet.txt would work assuming the data you train on also has phonetic spellings and not transcripts.

2 Likes

Thanks. What do you think about the question above? (power of phonetic alphabet)

I’d guess phonemes would work better, but we’re not choosing to use them for now.

1 Like

isn’t there a problem with the Language Model when simply converting train/test/dev transcripts in phonemes and alphabet.txt in phonemes? Due to that: is there a way of not using the LM while training and decoding after the training? or better than that a way to convert back the phoneme predicition to words and then use the LM? I’m sorry, I’m very new to this but I would like to build DeepSpeech from scratch and train for german language on phoneme level as I would like to use it for Lip Sync for an animation character given the audio file

You’d , of course, have to switch the language model to use phonemes too, or, alternatively, not use a language model.

Converting the phoneme predictions to words would work too, but you’d need to introduce an extra model to do so. Also, you’d need to take care that this extra model would have to handle mistakes in the phonetic spellings it’s converting. (So, for example, a simply map would likely not be ideal.)

1 Like