Question about project plans

bozden · September 16, 2023, 7:31am

I know you are moving the backbone and it will take a while, we all wish you good luck with that.

After the move, I think we all want to know about the immediate project plans, at least a rough timeframe, especially for the text-corpus-related workflow-breaking obstacles:

Multi-sentence entry form
Inclusion of bulk sentences through PR
Inclusion of exports from Wiki sources
Exports of text-corpora from the database to files
Basic statistics of how many sentences are waiting for validation

We’ve been working on new text-corpora for months now and we don’t know when we can be able to record them and analyze the whole text-corpus so that we can know where we are to plan further.

bozden · September 16, 2023, 10:54am

Wow, I missed this PR from today (resolving #2), thank you, folks! This helps a lot…

bozden · April 15, 2024, 9:33am

Because somebody bumped this, let me give some feedback on the current status:

Multi-sentence entry form

Inclusion of bulk sentences through PR

There is no multi-sentence entry form, but we now have bulk submissions (although a rather lengthy process).

On the other hand, there are some related issues posted in github. Because the sentences are not cleaned enough during entry (they can contain LF, CRLF, TAB like invisible chars), there are disruptions in text-corpora recently released with v17.0. But I know it is being worked on currently (just received the message, thanks team!).

Inclusion of exports from Wiki sources

AFAIK, the cv-sentence-extractor did not make changes to handle these yet, but it shouldn’t be hard to code… I don’t know how the workflow will change regarding this, but I can give a hand if you like @mkohler (cc: @jesslynnrose & @gina).

Exports of text-corpora from the database to files

Basic statistics of how many sentences are waiting for validation

I was thinking of exporting them to github under server/data/<lc> at that time, but the project moved away from the idea as it is not scalable. Instead, with v17.0 the text-corpora is part of the dataset distributions, which is perfect (thank you team!). On the other hand, we can see them every three months, not just-in-time.

This enables us to examine the test corpora statistically. As mentioned above, there are some issues and we cannot get the full information - but this is the first iteration, so we hope the issues will be fixed in the next version.

One of the issues also prevent getting statistics on “sentences waiting validation”, as “invalidated” and “not yet validated” ones are in the same bucket, without any clue which is which.

To check if there are sentences waiting to be validated, you just need to check the review page, just like you do for recordings through the validate page.

You can check your text-corpus statistics in the Analyzer text-corpus tab), as good as possible.

mkohler · April 15, 2024, 7:26pm

Do you mean the change to output a tsv file that can be uploaded to the bulk submission tool? Then yes, that indeed has not yet been done. For now it’s just a txt file with one sentence per line. It’s basically just whatever gets logged to the console. It probably would be fine to just adjust that log to be a tsv but then it might be more flexible to only do it as tsv when a special flag is passed. Then we could pass that only in the Github Actions and for anything else the output would not change. The source itself could be auto-generated easily with the actual source (“Wikipedia”, …) and the date for example.

Automatic upload might be nice, but IMHO not worth implementing given the volume. But then that also depends on the process behind it, as it would probably always be one person uploading these (needs to be someone who is trusted to not introduce changes).

bozden · April 15, 2024, 7:47pm

Yes, I meant this.

would be fine to just adjust that log to be a tsv
needs to be someone who is trusted

As the process will be under the umbrella of the CV team (that includes you of course), the source, related license are known, and the QC will be done during the process and reported on the repo I think it will be more easy. I want to assume the process will not include moves between CV and Mozilla Legal, no 50 sentence reviews etc, and it can be automatic (not code-wise, just process-wise).

Automatic upload might be nice

AFAIK, currently, all bulk sentence additions are manually put in a job queue to be processed, after they passed the process. So, these can be added to that queue more directly.