What is the right format for building a language model for chinese?

I am building a scorer for chinese and use all the transcripts from the common voice dataset. I am not a chinese actually but this is for research purpose. Thanks!