Including punctuation (and capitalization?) in the training text / language model?

utunga · July 5, 2020, 2:40am

Hey @lissyx you wrote :

We’ve been chatting about this a bit over here at Te Hiku, contemplating giving it a go here. Biggest question, of course, is how to handle the kenlm/language model side of things? I am thinking you probably kept everything lower case and just added “?”, “.”, “!” etc as tokens in their own right? Is that right?

Or did you try to include capitalization as well? I know that character level language models work well in the RNN/LSTM space for text generation - has anyone thought of trying to integrate such models into DeepSpeech? Maybe even an encoder/decoder based language model?

Note for others: Per documentation it is currently recommended to lower case and remove punctuation from training text and then perhaps use a different technique to ‘add it back in’, after DeepSpeech, based on context…

//cc @tippy_top

lissyx · July 6, 2020, 9:05am

I’m sorry, but this was like a 5 minutes tests, and I have no memory of how I did it.

utunga · July 6, 2020, 10:48am

OK cool no worries. Thanks though. If we try it we’ll report back.

Topic		Replies	Views
DeppSpeech output punctuation DeepSpeech	4	1127	April 3, 2020
Punctuation Model DeepSpeech	5	2768	December 11, 2019
Questions on fine-tuning 0.6.1 model DeepSpeech	6	626	February 19, 2020
How should be language model dataset? DeepSpeech	3	349	May 27, 2020
Dictation verses transcription task language models DeepSpeech	1	321	June 12, 2020

Including punctuation (and capitalization?) in the training text / language model?

Related topics