How should be language model dataset?

Hi i’m trying to build my own model. But for generating language model I’m quite confused that how should process the data.
Should i have to remove punctuation and numbers ? or is there a any tips and tricks for generating language model?

Please search before asking, this is the second time today. Read this and come back if you have questions :slight_smile:

https://deepspeech.readthedocs.io/en/master/Scorer.html#building-your-own-scorer

I don’t know did you read the question ? I’ve collected data and it has punctuations and also has numbers.
i’m asking that should i remove the punctuations and is there things that should i care for before creating model.
Would you just navigate me which part is my answer of my question according to your documentation that you sent? :slight_smile:

If you take the standard alphabet.txt included for English, remove all other punctuation from your input.

Convert numbers with num2words.

There are numerous ways to include punctuation and stuff, but start simple and take it from there.

1 Like