Hi, I am building a Hybrid ASR system on the CommonVoice dataset. Since pronunciation lexicon is absent in this dataset, I have to build such lexicon using a G2P model. Running a G2P model directly on the raw transcripts sometime may fail due to reasons such as punctuations, invalid chars and whether upper-case or lower-case. I understand the recent e2e ASR systems regard punctuations as part of the output sequence, while a typical phone-based ASR system will not model such punctuations, so I am wondering:
i) do you have any plan on the text preprocessing of this dataset? and
ii) is there any method or toolkit for us to preprocess the transcript?