Sentence collection tool - February 2019 milestone discussion

Context

We are about to release the functional beta where our goal was to have a functional tool to collect, review and export valid sentences to the Common Voice project.

Looking forward to the future there are a few themes that we consider important:

  • [Internationalization] Non-English speakers being able to use the tool, Common Voice really needs to grow in some non-English locales during the first half of 2019.
  • [User experience] The review phase should not be a blocker/pain to people looking to help.
  • [User experience] Contributors should feel the tool as agile and easy to use, even for high volume sentences.
  • [Quality] The quality of the validated sentences is high and they don’t need additional clean-up afterwards.

Priorities for the next milestone (proposal)

With the previous context and looking at all the current open issues, here you have a proposal for discussion on which should be our priorities in the next milestone. Everyone is invited to comment, even if you are just interested in the project and haven’t contributed yet.

Top priority (should happen next milestone)

  • Be able to localize the site.
  • Provide context on why sentences are not valid and ensure people can mass upload them.

Nice to have (only if we have time for next milestone)

  • Be able to edit wrong sentences during the review process
  • Add more validations (per locale, offensive language and other cleaning algorithms)

(Note that we can always move things to the next milestones, not being top priority doesn’t mean it won’t happen)

We’ll keep this topic open for comments until January 23rd and then @mkohler as community technical lead and myself will create a final roadmap based on all your comments.

Questions for you

  • What do you think about these priorities?
  • What would you change? (if anything)
  • Are we missing something? (not captured here or in the issues)

Thanks everyone!

3 Likes

About the priorities

I think they are nice!
Specially, easy the upload and review of wrong sentences.
But, i agree with the localization top priority, since they will need further work after.

Other things

Actually, what bother more me, while testing the tool are the following things.
But, they may not be top priority.

1) About collection and upload

A) For me, the hardest part of the collection workflow is the collection itself:

  • Find the CC0 content
  • Extract and format the content to upload

But, this may need another tool(s). (sorry)

B) For mass upload of sentences, one nice thing would be file upload (a .txt would be nice).

2) About review

The process is great to me, actually fun! :slight_smile:
What may be improved:

A) Really skip sentences

If i skip a sentence, it should not reappear in the next review page (i think @mkohler mentioned this problem here: https://github.com/Common-Voice/sentence-collector/issues/55)

Keep the good work, guys! :clap:

1 Like

This is interesting, we should understand this more and maybe create a separate topic with the people who have been really effective on this to get the community knowledge shared.

The priorities look good / realistic to me!

Hi,
Thanks for your work so far!
I think that the tool should be able to automatically segregate a text into sentences based on punctuation rules (commas and full stops at least). An explanation about why a sentence is invalid would be nice also.

1 Like

Do you mean you provide a full paragraph or a set of them and the tool would cut them into appropriate sentences? Can you describe a bit more what’s the need you are visualizing? How important this is for you compared with other features?

Thanks for your feedback!

Hi!
I used to play around with KDE Simon as a speech recognition interface, which did exactly that. I could just copy in a long text, and it would create a new sentence after each full stop. It’s not high prio (can still be done by replacing in a text editor . by \n or so), but it can make entering for example a whole chapter from a public domain book much quicker.

Στις Πέμ, 17 Ιαν 2019 στις 12:43 μ.μ., ο/η Rubén Martín via Mozilla Discourse discourse@mozilla-community.org έγραψε:

Thanks for this post @nukeador super helpful to see. :slight_smile:

For priorities (I mentioned this during sprint review today but wanted to follow up here):

  1. We should determine at what point Mozilla IAM login is integrated with the sentence collection tool. Perhaps not next milestone, but let’s start to track it within priorities.
  2. With IAM login integration, can we merge profiles from Common Voice with already existing sentence collection profiles so people start to see some continuity with their contributions across both instances? @mkohler @gregor this is something I hope we can keep in mind as the tool is built.
  3. Determine success criteria for the tool. At what point do we determine this beta is a success and take steps toward further UX/UI refinement and alignment with the wider Common Voice platform? e.g. the v2 prototype** the design team has started work on incorporates much of the feedback already listed in this post (ability to skip; ability to edit; sentence detection and count; file upload; sentence criteria clarity) **Design is considering beta as v1 & we plan to solicit feedback here when the v2 prototype is further along (Feb).

Agreed that the current priorities you’ve outlined take precedence. Looking to understand how the milestones relate to the beta phase and at what point we evaluate moving forward from beta.

1 Like

Hi,

@nukeador Would be amazing if we could login using github or such to avoid having to create an account.
Thing, we should probably feed the sentences that are reviewed in a random fashion and disallow people from seeing them all to prevent people from submitting sentences and manually reviewing those specific ones to mess with the data.

I wonder however why the Sentence collection tool is separate from the website of common voice and not integrated with it?

Thanks

My understanding was that the goal right now was to get the tool out ASAP so languages with less than 5000 sentences can get up on the site quicker.

Integrating with the main CV site is a later goal.

We know there are a lot of people who already have a big corpus ready for submission and it’s coming from different sources than their own hand, so we want to make it easy for them to review it. Note that one review is not enough for validating a sentence, so we should be fine here.

As @dabinat mentioned, we needed to have something quickly (and functional) out there to solve the submission and review process. We want to have something integrated in the main CV site in the future.

Thanks.

Hello everyone,

Yesterday @mkohler and I met to review the roadmap and your comments.

Based on that, we took the the following decision for the February milestone:

Top priority (should happen)

  • Be able to localize the site: It’s clear we really need to provide support for non-English speakers if we want to grow strong in some locales this year.
  • Provide context on why sentences are not valid and ensure people can mass upload them: Our goal is that collecting large sets of sentences is a nice experience.
  • Any new major bugs in the tool that break the workflow.

Nice to have (only if we have time after finishing top priorities)

  • Be able to edit wrong sentences during the review process (we estimated this could take some time to code)
  • Add more validations (per locale, offensive language and other cleaning algorithms)

We will be creating the milestone over github and prioritize work based on it. We will also encourage new people who want to be involved in the coding to focus on this milestone.

As usual, we will be keep issues open for bug reporting. Features request will be evaluated for the next milestones.

Thanks everyone for your input! :slight_smile:

1 Like

A post was split to a new topic: Sentence validation process