Text preprocessing for CommonVoice dataset

Hi, I am building a Hybrid ASR system on the CommonVoice dataset. Since pronunciation lexicon is absent in this dataset, I have to build such lexicon using a G2P model. Running a G2P model directly on the raw transcripts sometime may fail due to reasons such as punctuations, invalid chars and whether upper-case or lower-case. I understand the recent e2e ASR systems regard punctuations as part of the output sequence, while a typical phone-based ASR system will not model such punctuations, so I am wondering:

i) do you have any plan on the text preprocessing of this dataset? and
ii) is there any method or toolkit for us to preprocess the transcript?

Thanks in advance!

Hello,
You could use this shell script inside the folder with tsv files:

for file in *.tsv;
  do
  cut -f3 $file| sed -e "s/./\L&/g" -e 's/[[:punct:]]/ /g' -e 's/[ ]\+/ /g' > f3.txt;
  cut -f1-2 $file > f1-2.txt;
  cut -f4- $file > f4-n.txt;
  paste f1-2.txt f3.txt f4-n.txt > ${file/.tsv/}-no-punct.tsv;
  rm f1-2.txt f3.txt f4-n.txt;
done;

Hi,

Thanks for the help! May I ask what is the sed -e "s/./L&/g" used for?

To lowercase all text.

1 Like

Depending on what language you are working on commonvoice-utils might be useful.

This has been successfully used to build over 30 models for a variety of languages.

2 Likes

Wow, thanks a lot for the toolkit!