Because somebody bumped this, let me give some feedback on the current status:
- Multi-sentence entry form
- Inclusion of bulk sentences through PR
There is no multi-sentence entry form, but we now have bulk submissions (although a rather lengthy process).
On the other hand, there are some related issues posted in github. Because the sentences are not cleaned enough during entry (they can contain LF, CRLF, TAB like invisible chars), there are disruptions in text-corpora recently released with v17.0. But I know it is being worked on currently (just received the message, thanks team!).
- Inclusion of exports from Wiki sources
AFAIK, the cv-sentence-extractor did not make changes to handle these yet, but it shouldn’t be hard to code… I don’t know how the workflow will change regarding this, but I can give a hand if you like @mkohler (cc: @jesslynnrose & @gina).
- Exports of text-corpora from the database to files
- Basic statistics of how many sentences are waiting for validation
I was thinking of exporting them to github under server/data/<lc>
at that time, but the project moved away from the idea as it is not scalable. Instead, with v17.0 the text-corpora is part of the dataset distributions, which is perfect (thank you team!). On the other hand, we can see them every three months, not just-in-time.
This enables us to examine the test corpora statistically. As mentioned above, there are some issues and we cannot get the full information - but this is the first iteration, so we hope the issues will be fixed in the next version.
One of the issues also prevent getting statistics on “sentences waiting validation”, as “invalidated” and “not yet validated” ones are in the same bucket, without any clue which is which.
To check if there are sentences waiting to be validated, you just need to check the review page, just like you do for recordings through the validate page.
You can check your text-corpus statistics in the Analyzer text-corpus tab), as good as possible.