How should be language model dataset?

MaestusUada · May 27, 2020, 6:38pm

Hi i’m trying to build my own model. But for generating language model I’m quite confused that how should process the data.
Should i have to remove punctuation and numbers ? or is there a any tips and tricks for generating language model?

othiele · May 27, 2020, 7:05pm

Please search before asking, this is the second time today. Read this and come back if you have questions

https://deepspeech.readthedocs.io/en/master/Scorer.html#building-your-own-scorer

MaestusUada · May 27, 2020, 7:19pm

I don’t know did you read the question ? I’ve collected data and it has punctuations and also has numbers.
i’m asking that should i remove the punctuations and is there things that should i care for before creating model.
Would you just navigate me which part is my answer of my question according to your documentation that you sent?

othiele · May 27, 2020, 7:59pm

If you take the standard alphabet.txt included for English, remove all other punctuation from your input.

Convert numbers with num2words.

There are numerous ways to include punctuation and stuff, but start simple and take it from there.

Topic		Replies	Views
Help: how to generate a custom scorer? DeepSpeech	18	2755	August 13, 2021
Including punctuation (and capitalization?) in the training text / language model? DeepSpeech	2	800	July 6, 2020
Generating language model for a small vocabulary DeepSpeech	2	298	August 31, 2020
Learning new words for STT DeepSpeech	8	581	November 10, 2020
Tune MoziilaDeepSpeech to recognize specific sentences DeepSpeech	76	11561	March 25, 2023

How should be language model dataset?

Related topics