I completly agree about ONE (numeral) to MANY (words) issue.
But what the issue ? “He is born on 12/12/1212 to die on 10/10/1312” (9 words) will become “he is born on twelveth of december of one thousand and two hundred twelve to die on thenth of october of one thousand and three hundred twelve”. 27 words. …but it’s easy to split and/or to (humanly) correct if wrong. And again, it help the human that is trying to imput data, should he be a
power user or a
peon.
On the other end, my first approch of all the problem was that, actually :
we have sentence in the Common Voice data that are (quite) garbage due to previous EXTRACTOR sentence (from wikipedia)
that where rejected in the CorporaCreator (AFTER recording),
and i was thinking that Cleanup was working before validation (I was wrong ! see this PR for DOCS) ,
and before direct upload from extractor. (Yes, I went to COLLEctor, thinking it was the EXTRActor)
In other words, I’m working hard on cleanup, to solve a problem that will NOT be cleanup by cleanup.
Well.
I agree, I have missed some steps a the beginning. But i’m learning while doing it 
…
Anyway, now that I’m involved, I try to, with 1 stone 4 birds,
- improve COLLECTOR,
- try to have a cleanup tool available for FUTURE collections from EXTRACTOR,
- a tool to help to clean OLD EXTRACTOR garbage.
- a tool to eventually help bulk upload also
…not sure that it will fit all.
But having cleanup routines before validation should help to solve “common issues”, and again, lower the barrier for entry for new contributors. right ?
…The issue is not with “
power users” like you are, that have spreadsheets of what the recorded, but the issue is for newcomers like me, that are doing a mess because they want to do well, but don’t understand what (not) to do 
Shouldn’t we try to build a “one cleanup fits all” for everyone (collector, extractor, and bulk), or shall we build two/three separate files and cleanup routine, with most of it that will be common ?!