Updated Process for Submitting Bulk Sentence Requests

gina · February 6, 2024, 6:17am

Hello Community

We have finally reached the end of reviewing the bulk sentence request process. Thank you for your patience while the team worked on it. The new process will consist of the following steps after submitting a request:

Community coordinator will review the request to make sure that all the fields are completed and send the file to legal for review.
If legal approves, the community coordinator will communicate with the communities for quality assurance checks. This process involves the following:

Random selection of 50 sentences marked on the submitted file to be quality checked. Quality assurance comments should be provided in the “Sentence quality assurance feedback” column.
If the reviewed sentences meet the quality assurance criteria, the bulk sentences will be merged.
If the review reveals that some sentences are not sufficiently high quality and do not meet the criteria - eg. poor spelling or grammar, dubious open source provenance, violation of community guidelines. The reviewer should inform the community coordinator and the submitter to resolve the identified issues and resubmit.

If legal does not approve, the community coordinator will communicate with the submitter and provide reasons for rejection. The submitter will work with the community coordinator for resubmission.

Community members must only submit sentences that fall within the public domain. Alternatively, they should sign the CC0 waiver form to prevent any delays in the updated process.

Thank you for your cooperation, this is to ensure the continued high quality of the sentences.

Thanks

bozden · February 6, 2024, 1:32pm

Hey @gina, thank you for the update. Just some quick questions:

What is the medium of the initial and final request/submission? File upload via web form? Github? Is there a sample
Not all submissions are equal wrt sources. They can be books which dropped into public domain or self/community generated, etc. In the latter case, where they are already quality checked, most of these steps seem to be unnecessary.
In case of “questionable” legal status, wouldn’t it be logical to check with legal before preparing the resource file?

I’m asking these because in our workflow our community aims for 0 (zero) errors, multiple people read ALL sentences before submitting. I had several rejections in the past after such hard work…

neouyghur · February 7, 2024, 1:34am

Hi @gina, thanks for the updates. Do we need to reupload the sentences we uploaded before this policy?

gina · February 7, 2024, 10:07am

Hi @neouyghur yes please, kindly use the updated template.

gina · February 7, 2024, 10:16am

Hey @bozden, thank you for the questions.

The process is still the same, files are uploaded via the MCV website using the updated template.
Yes, that is indicated in the “Source” column of the template.
Given the steps involved in quality assurance, we prefer legal approves first.
We’ve had community members complain about bad sentences in the corpus, this is to ensure that we accept only high quality sentences.

neouyghur · February 8, 2024, 5:41am

Hi @gina, I uploaded the sentences with the new template through the MCV website. What is the next step?

Reverend · February 10, 2024, 10:15am

Is the CC BY-SA 3.0 license valid for bulk submissions?

bozden · February 10, 2024, 3:27pm

No @Reverend, only CC-0 / Public Domain…

neouyghur · February 14, 2024, 7:23am

How do we know the progress of legal checking? I have submitted bulk sentences, but I have not got any updates.

gina · February 14, 2024, 1:29pm

Hi @neouyghur, the file has been sent to legal for copyright license review, a process that may require some time. I will update you on the subsequent steps as soon as I receive feedback.

gina · February 14, 2024, 1:33pm

Hi @neouyghur this process may take some time, I will email you the next steps as soon as I receive feedback. Kindly note that after legal review, I need to find a quality assurance person to check the file, this may take more time, kindly be patient as we go through this process. I will keep you updated.

gina · February 14, 2024, 1:37pm

Yes, only CC-0 or sentences in the public domain

neouyghur · February 15, 2024, 3:42am

Hi @gina thanks for your reply. I am one of the members of the Uyghur community who is actively contributing to the CV Uyghur dataset. I can assist you in finding the right person to evaluate the sentences written in Uyghur.

bozden · February 16, 2024, 3:48am

@gina, I’m adapting our collaborative Google Sheet template to the new style, but I have a question about the template:

In your examples, you used real names as the source. Is this mandatory?

We are working as a community and we switched to use nicknames after the TTS hit. I’m more or less a public figure, but I don’t want to expose any other people with their real names… “Community member, copyright waived” would suffice I’d like to presume…

gina · February 21, 2024, 11:56am

Hi @bozden, yes that would suffice.

bozden · February 22, 2024, 12:29am

I assume, at the final stage the import script will be triggered manually, like it is now.

But with so many fields in free-form, especially the “Domain” column, you will have a hard time handling these.

This is what is inserted into the DB:

  await db.runSql(`
    INSERT INTO sentence_domains (domain)
    VALUES ('general'),
      ('agriculture'),
      ('automotive'),
      ('finance'),
      ('food_service_retail'),
      ('healthcare'),
      ('history_law_government'),
      ('media_entertainment'),
      ('nature_environment'),
      ('news_current_affairs'),
      ('technology_robotics'),
      ('language_fundamentals')
  `)
}

But you might get anything there, even in localized language. I think a more strict rule for that would be much better.