Sentence Collector - Cleanup before export vs. cleanup on upload

mkohler · September 13, 2022, 10:37pm

Hi everyone,

@wannaphong has brought up a topic on GitHub I’d like to discuss here. Namely, the sentences from the sentence-collector.txt file in the Common Voice data folder do not necessarily need to match sentences in the Sentence Collector DB and therefore are not findable through the Sentence Collector API.

This happens in the following scenario:

Somebody uploads a sentence
That sentence passes validation
Language has some cleanup transformations specified, which gets applied when exporting to that folder
The sentence in the sentence-collector.txt file is the cleaned up sentence, while the sentence in the Sentence Collector DB is still the “old” one
Therefore the API logic does not find the sentence and returns nothing

This can currently happen with English, French and Thai. Other languages do not have any cleanups specified. These cleanups are currently mostly used due to missing validations at the beginning, and were adjusted when adjusting validations to catch already existing sentences that should be fixed.

Generally I always had the opinion that we should avoid the cleanup scripts as much as possible. However, while thinking about this problem, this has changed. I think the cleanup can be a good way of letting contributors upload sentences, and then do corrections on them as needed. This means less “errors” for contributors. For certain things, we can accept the sentence and do a transformation on it, rather than simply rejecting it and letting the contributor figure out what needs to be fixed.

Therefore, I’m suggesting the following:

Instead of running the cleanup while exporting, let’s run it after validating sentences and before writing them to the Sentence Collector database
This would mean that there is no discrepancy between the sentence-collector.txt file and the database, the API would also work correctly
For all the existing sentences we would need to run a migration to apply all the cleanups to the sentences in the database before moving the cleanup within the process.

There might be something I’m missing, so I’d love to hear all your input on this. What do you think of that? Can you image something that would break if we did this? Do you see any downsides?

Thanks!
Michael

bozden · September 14, 2022, 1:13am

I think it is best that all database entries reflect the final state, and I don’t see anything missing from here - after migration. As the recordings are done using the final format, voice data should also be OK.

A small thing thou: During entry on SC, the data is checked against the existing DB. Somebody can enter the same erroneous sentence and it will not be recognized until after validation. Also validators can reject them as they think that is not a correct sentence.??

But I don’t understand why it is a good thing to allow them in the first place. I think the contributors should not blindly copy paste stuff and let the validators figure it out. Sometimes the validators also don’t see the mistakes. Three eyes are better than two…

I’m saying this without knowing what those clean-up routines are cleaning, so I may be wrong on that Never looked at that code…

I always clean-up stuff before laying eyes on it (e.g. pre-processing scanned books where I correct stuff like “…” to “…”, “!.” to “!” or correcting common spelling mistakes, converting numbers to texts etc.

bozden · September 14, 2022, 1:19am

Looked at the code (you already gave the link ). Similar stuff. No, I don’t think there is a problem.

But I think, it is better to do the cleanup after entry, but before validation, so at least two people control it.

mkohler · September 14, 2022, 6:36am

Thanks for the input!

I absolutely agree!
I could have been clearer on this one. When I said “after validation”, I meant the validation that happens directly when uploading sentences: sentence-collector/server/lib/validation/VALIDATION.md at main · common-voice/sentence-collector · GitHub, not the manual “review” by the other contributors.

I think we’re on the same page then?

HelloTheWorld · September 14, 2022, 8:30am

Hi guys & girls,

Sorry to interrupt, just to discuss this issue

github.com/common-voice/sentence-collector

Improving the introduction README file with very simple TLDR

opened 08:27AM - 14 Sep 22 UTC

closed 10:31AM - 17 Sep 22 UTC

CapitainFlam

enhancement docs P1

_TL;DR: the readme file is harsh for newcomers, it's lacking an oversimplified o…verview of WTF is doing sentence-collector and in what order._ Lowering the barrier for understanding the project shall help people to come. I hope. ...And even if not, ME, I need this stuff to be written down, to be clarified and to be understandable for a dummy like me. After (trying to) work on the PR https://github.com/common-voice/sentence-collector/pull/635 (full disclosure : I started it 😸), I am now convinced that something is missing in the introduction of the sentence collector [[README.md](https://github.com/common-voice/sentence-collector#readme)](https://github.com/common-voice/sentence-collector#readme) file. Let me explain in two words. GLOBALLY (as discussed in an other FR discussion and/or issue) we have `Sentence-collector > recording & review > CorporaCreator ` LOCALLY, I was thinking that thes steps were `Import start > Cleanup > Validation > Add it Common Voice database ready for recording & review. ` And it seems that the steps are `Import start > Validation > Cleanup > Add it Common Voice database ready for recording & review. . ` There is a discussion [here (sentence-collector-cleanup-before-export-vs-cleanup-on-upload/105411)](https://discourse.mozilla.org/t/sentence-collector-cleanup-before-export-vs-cleanup-on-upload/105411) that seems to discuss this point... In short, I'm lost. ### **So, MY PROPOSAL IS :** to add a little paragraph right before Get Involved saying : # Common Voice Sentence Collector The [Sentence Collector](https://commonvoice.mozilla.org/sentence-collector/) is part of the [Common Voice](https://commonvoice.mozilla.org/) project. Its purpose is to provide a tool for contributors to upload public domain sentences, which then can get reviewed and are exported to the Common Voice database. Once imported they will show up for contributors on Common Voice to read out aloud. ## Quick overview Sentence-collector is the "entry point" for data to be [recorded and reviewed](https://github.com/common-voice/common-voice). Then, once ready, it will go to [CorporaCreator](https://github.com/common-voice/CorporaCreator). Under the hood, Sentence-collector work like this (oversimplified version) : - You take a buch of sentence you want to add the the database (as explained [here](https://commonvoice.mozilla.org/fr/about?tab=how-add-sentences)), - then it's [cleaned up](https://github.com/common-voice/sentence-collector/tree/main/server/lib/cleanup), - after, it's [validated](https://github.com/common-voice/sentence-collector/tree/main/server/lib/validation), - and finally sent to the text database for recording & review. We can always go deeper in documentation in [common-voice docs](https://github.com/common-voice/common-voice/tree/main/docs). ## Get involved (...)

that could help ME to understand what you’re talking about

Many thanks, you can resume your discussion here.

ftyers · September 14, 2022, 12:41pm

This is potentially nice, but I think that both should be saved (e.g. the updated sentence — which would be the one uploaded to Common Voice, and the original sentence — which could be used for identifying sentences based on their original form).

My reasoning is that many people pre-validate sentences in e.g. Google Sheets or something like that before uploading them, and also may add metadata there for example variant or domain etc. This would break the workflow of those people (like me), which I realise is not ideal, but is necessary to get around current limitations.

bozden · September 14, 2022, 2:01pm

Yes, I also do pre-validate them, also pre-checking against the existing data from latest text corpus to prevent multiples beforehand.

But this is not enough as simple punctuation changes can cause doubles. Therefore I normalize both parts and also compare them (only a warning, I must hand validate stuff like “Tea.” and “Tea?”).

Lately I also added “similarity” and optional dictionary checks (if it adds enough new dictionary) to my workflow to further limit them.

There is no cleaning process for Turkish in SC, I already do that while extracting. But if there were (or in case of other people’s posts) they would not be recognized as doubles in SC, in case somebody else tries to use the same sources I’ve been using.

PS: I just recognized I’ve been over-doing

bozden · September 14, 2022, 8:11pm

Hey @mkohler, it seems that somebody already did some automatic cleaning on sentences.

I found these while trying to figure out why we still keep having “other.tsv” although each time we cleaned the waiting list.

This is from other.tsv from v 9.0:

This is from other.tsv from v 10.0:

Please be aware some sentences fully enclosed in quotes got corrected. All these stuck ones are from older times (from SETimes) btw… I had these on file and checking against them while adding new sentences. This corresponds to what @ftyers was saying… I think I need fresh files…

(No overheating on my machine yet…)

HelloTheWorld · September 15, 2022, 1:26pm

Hello everyone again.

After reading (more than a dozen time) everything (including links and code and so…)

Here is my recap :

as Michael @mkohler said, ACTUALLY, the steps are :

gather a bunch of sentences to send to the common voice DB for recording (…it’s raw material on your computer)
(optional) manually pre cleanup as Bülent @bozden does. More or less, that means doing some the stuff done by cleanup, like changing " . . . " to “…” but BEFORE everything start on CommonVoice servers.
then send it to Sentence Collector (here ?)
then the VALIDATION stuff goes on, removing everything flagged (numerals, ACRONYMS, etc.)
If NOT rejected, it’s recorded in the “INPUT stuff”, aka Sentence Collector DB (…link ?) (If rejected, the page will show an error message like shown below)
Then the EXPORT ROUTINE goes on, and create the sentence-collector.txt… But, warning, this sentence-collector.txt file contains “clean[ed]up stuff”, and no more the “original stuff” still in the Sentence Collector DB
After that, the sentence-collector.txt is sent to Common Voice (via some magic I don’t understand now. Don’t worry, it’s magic ! )

I trying to trigger the validation process… And it worked . Here is an error message shown for french import of numeral (1 2 0…)

the new proposition I can read above (and propose myself) is:
MODIFICATION from previous workflow are in ITALIC

gather a bunch of sentences to send to the common voice DB for recording (…it’s raw material on your computer)
(optional) manually pre cleanup as Bülent @bozden does. More or less, that means doing some the stuff done by cleanup, like changing " . . . " to “…” but BEFORE everything start on CommonVoice servers.
then send it to Sentence Collector (here ?)
then process it with CLEAN-UP routine. And record both raw and cleaned-up text as ‘awaiting validation’ text.
then the VALIDATION stuff goes on, on the cleaned-up text (great idea !), removing everything flagged (numerals, ACRONYMS, etc.) EXCEPT IF they have been changed to TEXT numerals or full described acronyms, thus they will not trigger anymore VALIDATION barriers (…like I’m proposing in French cleanup files in pull 635)
If NOT rejected, it’s recorded in the “INPUT stuff”, aka Sentence Collector DB (…link ?) with BOTH RAW and CLEANED-UP version, for ‘duplicate controls’ and ‘no reupload controls’ as requested by Francis @ftyers
(Michael @mkohler was proposing to run clean-up at THIS step… AFTER validation. In this proposal, I put it BEFORE validation)
Then the EXPORT ROUTINE goes on, and create the sentence-collector.txt… (with cleaned-up stuff only ?)
After that, the sentence-collector.txt is sent to Common Voice (same magic as before, but it’s a woman this time )

Please fell free to correct me !

@ Michael @mkohler, I totaly agree that it’s a game changer in clean-up handling that may be more than the team can chew. But I think it’s for good. (and it’s definitly out of my league, so i’m just proposing, I won’t be able to implement it myself)

mkohler · September 15, 2022, 8:42pm

@HelloTheWorld Yes, I think you got most parts right. There are some missing steps though. I have created this Pull Request to update the README in the repository if you could have a look at it

This is contrary to my initial suggestion in the post, however I think this does indeed bring up a good point. With my suggestion the cleanup could not be used to correct anything that would not pass the automatic validation. That’s why I commented on your PR that parts of your cleanup proposals are not needed, as they would not even pass validation (and therefore never get to the cleanup step).

I have to say, I kinda like the approach of running the cleanup before the validation, so that for example numbers could be converted from “2” to “two” etc. Of course, that might not be the best example (or easy to do) for each language. Overall we wouldn’t lose anything by switching this around. @bozden @ftyers do you have any preferences on which way around this is done?

Thanks for bringing up this point. I was not aware of this. I definitely would agree that we should support these use cases. I just don’t know exactly if saving both variants is the way to go here and what implications this would have. Let me think about this for a bit.

bozden · September 15, 2022, 10:00pm

In many languages, numbers will have very long text counterparts, especially dates and apostrophes are problematic. I hit these in my conversion routines:

1984 => Bin dokuz yüz seksen dört (added 4 words and/or lots of sentence length, will fail initial validation)
1984’de (in 1984) =>Bin dokuz yüz seksen dörtte (drop of ’ and change of d=>t)
1. sırada ben varım (I’m the second in line) => İkinci sırada ben varım
21/01/1984 => ?
-10 - +10 C arasında (between minus ten and plus 10 degrees)
Formatting differences: 1.234.567,89 or 1,234,567.89 or %1,20 or %1.20 (in Turkish texts data may come from English and use the other format etc)

So, they are not simple, not even for the easiest case. You tokenize the line and try to make sense of those numbers. After a lot of language specific code you convert most numbers, but then you have re-scan it to correct the edge cases. THEN post to Sentence Collector.

I would never do it on SC. IMHO, people should enter edited/corrected sentences to SC, also correcting possible old wording to new counterparts.

mkohler · September 15, 2022, 10:18pm

It was just the first example that came to mind, as I said, not necessarily a good one What’s your opinion on switching the order of validation and cleanup in general? Even if it’s currently not enabled for Turkish, does that sound reasonable?

bozden · September 15, 2022, 10:49pm

Yes I’m aware

I wanted to emphasize that things can go wrong in many ways (e.g. some rules can mess up with opening/closing quotes, some sentences have both of them and/or can mix with apostrophes, removing some spaces can mess up with tokenizers etc) and my last paragraph is important… While I was writing the preprocessor, I had a look at those files, especially French is detailed, but I had to exclude some of them for the sake of other steps and applied some after the whole thing.

Anyway, the correct order for me, if this is implemented in a language:

Human pre-editing / conversions => Add => Cleaning => SW validation => 2 * Human Validation => …

+ Keeping both versions and providing both on the API, and/or an API endpoint “get_auto_cleaned_sentences”

Hope this helps…

HelloTheWorld · September 16, 2022, 1:29pm

I completly agree about ONE (numeral) to MANY (words) issue.

But what the issue ? “He is born on 12/12/1212 to die on 10/10/1312” (9 words) will become “he is born on twelveth of december of one thousand and two hundred twelve to die on thenth of october of one thousand and three hundred twelve”. 27 words. …but it’s easy to split and/or to (humanly) correct if wrong. And again, it help the human that is trying to imput data, should he be a power user or a peon.

On the other end, my first approch of all the problem was that, actually :
we have sentence in the Common Voice data that are (quite) garbage due to previous EXTRACTOR sentence (from wikipedia)
that where rejected in the CorporaCreator (AFTER recording),
and i was thinking that Cleanup was working before validation (I was wrong ! see this PR for DOCS) ,
and before direct upload from extractor. (Yes, I went to COLLEctor, thinking it was the EXTRActor)

In other words, I’m working hard on cleanup, to solve a problem that will NOT be cleanup by cleanup.
Well.
I agree, I have missed some steps a the beginning. But i’m learning while doing it

…

Anyway, now that I’m involved, I try to, with 1 stone 4 birds,

improve COLLECTOR,
try to have a cleanup tool available for FUTURE collections from EXTRACTOR,
a tool to help to clean OLD EXTRACTOR garbage.
a tool to eventually help bulk upload also

…not sure that it will fit all.

But having cleanup routines before validation should help to solve “common issues”, and again, lower the barrier for entry for new contributors. right ?

…The issue is not with “ power users” like you are, that have spreadsheets of what the recorded, but the issue is for newcomers like me, that are doing a mess because they want to do well, but don’t understand what (not) to do

Shouldn’t we try to build a “one cleanup fits all” for everyone (collector, extractor, and bulk), or shall we build two/three separate files and cleanup routine, with most of it that will be common ?!

mkohler · September 17, 2022, 10:18am

This already exists: GitHub - common-voice/cv-sentence-extractor: Scraping Wikipedia for fair use sentences. The problem here is that back in 2019 when it wa ran for 2019, many of these rule possibilities did not exist and the French rule file is very minimal. Of course this can be fixed now in case an extract of articles created since then ever would be run.

In a perfect world…

HelloTheWorld · September 30, 2022, 9:38am

So, back to

github.com/common-voice/sentence-collector

[WIP] additionnal lib/cleanup for French language to improve quality of inputs

common-voice:main ← CapitainFlam:main

opened 11:57AM - 13 Sep 22 UTC

CapitainFlam

+169 -4

<H1>better french input for better french output ?</h1> Hi ! ⚠️ Firsts comm…its in GitHub, first pull request ever in a repo, so please neither shoot or shout at me 😅 ⚠️ This commit comes from different discussion, that I propose you to have a look to understand the context. - https://github.com/common-voice/commonvoice-fr/issues/21 _(...This issue should be partially solved with this pull)_ - https://github.com/common-voice/CorporaCreator/pull/87 _(...This is inspirational commit that I took some ideas from)_ and it is somehow connected to : - https://github.com/common-voice/commonvoice-fr/issues/20 - https://commonvoice.mozilla.org/fr/criteria - https://discourse.mozilla.org/t/rfc-criteres-dacceptation-pour-la-validation-denregistrements/73063/12 - https://discourse.mozilla.org/t/proportion-de-noms-propres-a-consonnance-etrangere/72887/26 - https://github.com/common-voice/common-voice/pull/3786 - https://github.com/common-voice/common-voice/issues/3792 - https://github.com/common-voice/cv-sentence-extractor/issues/92 **TL;DR: I'm trying to improve input quality for french sentence collector, to avoid **_A/_** garbage IN - garbage OUT, and _B/_ avoid to remove in the end (in the [CorporaCreator](https://github.com/common-voice/CorporaCreator)) the garbage that went through collecting, recording, review... to finally being dropped and not being included in the final release for training batchs.** Better to remove it as soon as sentence collector. <H1>What have you done ? 😱 </h1> To do so, I duplicated a EN to FR file in sentence collector, and modified it according to a previous job made by Nicolas Panel for CorporaCreator (and sadly not commited). According to discussion (links [here](https://github.com/common-voice/commonvoice-fr/issues/21#issuecomment-1242611833)... Did you follow it from the list above ?!), it's recommanded that I create a WIP pull request, to allow everyone to comment, throw tomatos and/or additionnal commits to it. footnote : it can be hard (...it IS hard !!!) to understand REGEX (REGular EXpressions). Do not hesitate to catch up with https://regex101.com/ to understand and test it.

In different place, you say that “validation will not let through numbers” or things like that.

As we discussed and I proposed before, I think we should have a “one RegularExpression fit all”. It’s (IMHO) not logical to have multiple rules files, it is (still IMHO) a waste of energy, as different projects and people are trying to do the same : having something clean.

So, what is my point :

== actual situation ==

We have a bunch of inputs (sentence collector, sentence extractor, direct PR) with non coordinated rules.

Validation in sentence collector is made before cleanup (and before recordings), and we have a second "last minute last resort cleanup with rules in CorporaCreator. (after recordings)

Rules for cleanup before, and after, are quite the same. But managed in different repositories, and even not connected.

== a vision ==

There could be one big Regular Expression file for all Cleanup, that could be adressed as sentence collector, sentence extractor, or corporaCreator.

== what is happening in the transition time ==
If people don’t use it for elsething that Collector… Well, it will still be used by collector

If people use it, it have to be as complete as possible.

For things that are ACTUALLY removed by validation, BUT WERE NOT (in previous PR), or WILL NOT (if we carry cleanup BEFORE validation), RegEx for cleanup are, either case :

Not triggered. So no one care. And they are available for later,
Triggered. And they do the job.

== Wait! What are the drawbacks?

Actually, I don’t think there are really any.

There is more code to read and maintain! Yes, but everything would be in one place.
Code is not useful! Yes, not today, in the context of Sentence Collector… But it could be used for the CorporaCreator (see the never finished PR below)
Nobody will follow or work with a single RegEx file! Maybe. But having a usable ‘starter file’ for those starting out would lower the barrier to entry. And one can always hope that the old timers, seeing the work already done, will complete the file for the community!

== And aren’t you solving a non existing problem?==

Am I sinking in a Base rate fallacy, trying to solve a problem that nobody has ?
…Well, I don’t think so. Have a look at this PR for example was never finished, but we are on the same page. …And it was for CorporaCreator, AFTER recording
See discussion and PR like this, this, or this, or that frenchy one to see that we are cleaning up old datas, and you could avoid it in the future. The best garbage removal is the garbage you don’t let go in.

So, in conclusion:

TL;DR: I think that we could leave all ‘actually unuseful sentence collector’ RegEx in cleanup files for future usage and/or other cleanup usage (extractor, bulkload, corporacreator). All this in a “see the bigger picture and build future-proof” bold move.

or ?

mkohler · October 1, 2022, 1:03pm

I still agree that this is a good idea, but that will require some exploration on how we can achieve that. I will create a new thread specifically for that. Let’s focus on the original topic here as that is still important, no matter if we have a common rules description or not.

Edit: done here: Common rule files for Sentence Collector / Sentence Extractor

I disagree. If we keep it in your PR as it is now, there is a chance other contributors will look at it, figure they need to do the same and invest a lot of (for now) unnecessary time and it can also lead to confusion. Now your argument might be “that will also be helpful in the future”. Maybe, but as long as we do not know the data format and how exactly it will be applied, this is just a guess.

That being said, I do not want to lose work you’ve already done either. Therefore, if you add a comment on the top of the file that commented-out lines are currently not necessary but might be useful for the future and comment out all the lines I mentioned in the PR, then I’d be fine with merging this PR.

Do you agree with this approach?

HelloTheWorld · October 3, 2022, 12:22pm

While I was reading

I disagree. If we keep it in your PR as it is now, there is a chance other contributors will look at it, figure they need to do the same and invest a lot of (for now) unnecessary time and it can also lead to confusion. Now your argument might be “that will also be helpful in the future”. Maybe, but as long as we do not know the data format and how exactly it will be applied, this is just a guess.

I was thinking how to keep the idea for future me ?

And then was thinking as you did,

That being said, I do not want to lose work you’ve already done either. Therefore, if you add a comment on the top of the file that commented-out lines are currently not necessary but might be useful for the future and comment out all the lines I mentioned in the PR, then I’d be fine with merging this PR.

So, we are on the same page

I’ll comment the things that are not active now, and I’ll submit it again.

Thanks Michael !

Topic		Replies	Views
Common rule files for Sentence Collector / Sentence Extractor Common Voice	2	572	October 2, 2022
Mass import sentences into Sentence Collector Common Voice sentence-collection	5	616	February 7, 2019
[ACTION REQUIRED] New Sentence Collector Infrastructure and Improvements Common Voice announcements	9	1712	November 9, 2020
Sentence validation process Common Voice sentence-collection , feedback	7	1327	January 31, 2019
The Sentence Collector is going to change! Common Voice	5	590	March 15, 2023

Sentence Collector - Cleanup before export vs. cleanup on upload

Related topics