Updated Process for Submitting Bulk Sentence Requests

Hello Community

We have finally reached the end of reviewing the bulk sentence request process. Thank you for your patience while the team worked on it. The new process will consist of the following steps after submitting a request:

  1. Community coordinator will review the request to make sure that all the fields are completed and send the file to legal for review.
  2. If legal approves, the community coordinator will communicate with the communities for quality assurance checks. This process involves the following:
  • Random selection of 50 sentences marked on the submitted file to be quality checked. Quality assurance comments should be provided in the “Sentence quality assurance feedback” column.

  • If the reviewed sentences meet the quality assurance criteria, the bulk sentences will be merged.

  • If the review reveals that some sentences are not sufficiently high quality and do not meet the criteria - eg. poor spelling or grammar, dubious open source provenance, violation of community guidelines. The reviewer should inform the community coordinator and the submitter to resolve the identified issues and resubmit.

  1. If legal does not approve, the community coordinator will communicate with the submitter and provide reasons for rejection. The submitter will work with the community coordinator for resubmission.

Community members must only submit sentences that fall within the public domain. Alternatively, they should sign the CC0 waiver form to prevent any delays in the updated process.

Thank you for your cooperation, this is to ensure the continued high quality of the sentences.

Thanks

1 Like

Hey @Gina_Moape, thank you for the update. Just some quick questions:

  1. What is the medium of the initial and final request/submission? File upload via web form? Github? Is there a sample
  2. Not all submissions are equal wrt sources. They can be books which dropped into public domain or self/community generated, etc. In the latter case, where they are already quality checked, most of these steps seem to be unnecessary.
  3. In case of “questionable” legal status, wouldn’t it be logical to check with legal before preparing the resource file?

I’m asking these because in our workflow our community aims for 0 (zero) errors, multiple people read ALL sentences before submitting. I had several rejections in the past after such hard work…

3 Likes

Hi @Gina_Moape, thanks for the updates. Do we need to reupload the sentences we uploaded before this policy?

1 Like

Hi @neouyghur yes please, kindly use the updated template.

1 Like

Hey @bozden, thank you for the questions.

  1. The process is still the same, files are uploaded via the MCV website using the updated template.
  2. Yes, that is indicated in the “Source” column of the template.
  3. Given the steps involved in quality assurance, we prefer legal approves first.
    We’ve had community members complain about bad sentences in the corpus, this is to ensure that we accept only high quality sentences.
1 Like

Hi @Gina_Moape, I uploaded the sentences with the new template through the MCV website. What is the next step?

Is the CC BY-SA 3.0 license valid for bulk submissions?

No @Reverend, only CC-0 / Public Domain…

2 Likes

How do we know the progress of legal checking? I have submitted bulk sentences, but I have not got any updates.

Hi @neouyghur, the file has been sent to legal for copyright license review, a process that may require some time. I will update you on the subsequent steps as soon as I receive feedback.

Hi @neouyghur this process may take some time, I will email you the next steps as soon as I receive feedback. Kindly note that after legal review, I need to find a quality assurance person to check the file, this may take more time, kindly be patient as we go through this process. I will keep you updated.

Yes, only CC-0 or sentences in the public domain

Hi @Gina_Moape thanks for your reply. I am one of the members of the Uyghur community who is actively contributing to the CV Uyghur dataset. I can assist you in finding the right person to evaluate the sentences written in Uyghur.

@Gina_Moape, I’m adapting our collaborative Google Sheet template to the new style, but I have a question about the template:

In your examples, you used real names as the source. Is this mandatory?

We are working as a community and we switched to use nicknames after the TTS hit. I’m more or less a public figure, but I don’t want to expose any other people with their real names… “Community member, copyright waived” would suffice I’d like to presume…

Hi @bozden, yes that would suffice.

1 Like

I assume, at the final stage the import script will be triggered manually, like it is now.

But with so many fields in free-form, especially the “Domain” column, you will have a hard time handling these.

This is what is inserted into the DB:

  await db.runSql(`
    INSERT INTO sentence_domains (domain)
    VALUES ('general'),
      ('agriculture'),
      ('automotive'),
      ('finance'),
      ('food_service_retail'),
      ('healthcare'),
      ('history_law_government'),
      ('media_entertainment'),
      ('nature_environment'),
      ('news_current_affairs'),
      ('technology_robotics'),
      ('language_fundamentals')
  `)
}

But you might get anything there, even in localized language. I think a more strict rule for that would be much better.