Spoken text normalisation

nmstoker · August 29, 2019, 3:35pm

What’s the approach regarding spoken text normalisation now that sentences are coming directly from Wikipedia / public domain books (which would have the text in written rather than spoken form)?

I believe when Common Voice started numbers were excluded or had to be written out (I might have this wrong but I think that’s what I recall seeing). Is there a step for the import that normalises them or do numeric and other normalisable things get excluded somehow? If there is something doing normalisation, how robust is it, (it seems quite a hard problem)

If the sentences aren’t normalised then presumably they’ll challenge deepspeech (as what is actually spoken won’t correspond directly) and there’s a chance that some people will normalise them differently if this isn’t done (eg 15.45 could be fifteen forty-five or quarter to four). Numbers are the biggest category but there are plenty of other cases (eg abbreviations, acronyms vs initialisms etc).

nukeador · August 29, 2019, 4:26pm

The same rules we applied to sentence collection, are being applied to sentence extraction, you can read them all here:

https://common-voice.github.io/sentence-collector/#/how-to

nmstoker · August 30, 2019, 12:08pm

Thank you - I’d missed that

Topic		Replies	Views
Extending our sentence collection capabilities Common Voice sentence-collection , announcements	19	3702	September 11, 2019
Czech Wikipedia extraction concerns Common Voice sentence-collection	9	845	February 1, 2020
Common voice sentences are the opposite of "common" Common Voice participation , sentence-collection , feedback , issue	27	3809	September 7, 2024
Bulk sentences submission from Wikipedia Common Voice sentence-collection	4	610	August 12, 2024
About the new English Sentences Common Voice feedback , issue	37	3341	May 31, 2019

Spoken text normalisation

Related topics