Sentence Collector - Cleanup before export vs. cleanup on upload

HelloTheWorld · September 16, 2022, 1:29pm

I completly agree about ONE (numeral) to MANY (words) issue.

But what the issue ? “He is born on 12/12/1212 to die on 10/10/1312” (9 words) will become “he is born on twelveth of december of one thousand and two hundred twelve to die on thenth of october of one thousand and three hundred twelve”. 27 words. …but it’s easy to split and/or to (humanly) correct if wrong. And again, it help the human that is trying to imput data, should he be a power user or a peon.

On the other end, my first approch of all the problem was that, actually :
we have sentence in the Common Voice data that are (quite) garbage due to previous EXTRACTOR sentence (from wikipedia)
that where rejected in the CorporaCreator (AFTER recording),
and i was thinking that Cleanup was working before validation (I was wrong ! see this PR for DOCS) ,
and before direct upload from extractor. (Yes, I went to COLLEctor, thinking it was the EXTRActor)

In other words, I’m working hard on cleanup, to solve a problem that will NOT be cleanup by cleanup.
Well.
I agree, I have missed some steps a the beginning. But i’m learning while doing it

…

Anyway, now that I’m involved, I try to, with 1 stone 4 birds,

improve COLLECTOR,
try to have a cleanup tool available for FUTURE collections from EXTRACTOR,
a tool to help to clean OLD EXTRACTOR garbage.
a tool to eventually help bulk upload also

…not sure that it will fit all.

But having cleanup routines before validation should help to solve “common issues”, and again, lower the barrier for entry for new contributors. right ?

…The issue is not with “ power users” like you are, that have spreadsheets of what the recorded, but the issue is for newcomers like me, that are doing a mess because they want to do well, but don’t understand what (not) to do

Shouldn’t we try to build a “one cleanup fits all” for everyone (collector, extractor, and bulk), or shall we build two/three separate files and cleanup routine, with most of it that will be common ?!

Topic		Replies	Views
Sentence collection tool development topic Common Voice sentence-collection , announcements	32	4046	January 26, 2019
Common rule files for Sentence Collector / Sentence Extractor Common Voice	2	592	October 2, 2022
We want your feedback: Improving the sentence collection Common Voice sentence-collection , feedback	39	8886	January 9, 2019
Question about CV Sentence Extractor quality and your experience Common Voice	18	1565	August 30, 2023
Common Voice New Sentence Collector Common Voice	15	993	August 12, 2023

Sentence Collector - Cleanup before export vs. cleanup on upload

Related topics