Spoken text normalisation

What’s the approach regarding spoken text normalisation now that sentences are coming directly from Wikipedia / public domain books (which would have the text in written rather than spoken form)?

I believe when Common Voice started numbers were excluded or had to be written out (I might have this wrong but I think that’s what I recall seeing). Is there a step for the import that normalises them or do numeric and other normalisable things get excluded somehow? If there is something doing normalisation, how robust is it, (it seems quite a hard problem)

If the sentences aren’t normalised then presumably they’ll challenge deepspeech (as what is actually spoken won’t correspond directly) and there’s a chance that some people will normalise them differently if this isn’t done (eg 15.45 could be fifteen forty-five or quarter to four). Numbers are the biggest category but there are plenty of other cases (eg abbreviations, acronyms vs initialisms etc).

The same rules we applied to sentence collection, are being applied to sentence extraction, you can read them all here:


Thank you - I’d missed that

