Sentence Collector Open Discussions - Input needed

Hi everyone

I’m cleaning up the issues in the Sentence Collector project, and moving open discussions here. We can then file more concrete issues once we came to a conclusion for those.

Happy to hear your thoughts!

Randomize Sentence Order in Review section

Originally reported at:

There seem to be instances where reviewing can get boring, as there are alphabetical lists of similar sentences right after each other. This is probably due to some list of sentences being uploaded. We are currently surfacing the most reviewed sentences first, in the same order as they were created.

Is this a general issue you’ve encountered? Do you think randomizing while keeping the most reviewed sentences at the front would improve the process?

Process rejected sentences

Originally reported here:

Currently rejected sentences are not actionable, as they are simply a list in their own section. The suggestion in this issue was the following:

  • Allow changes to the sentences and resubmit them from within the list instead of having to copy them over manually
  • Allow sentences to be marked as correct without change and resubmit them

My thoughts: I don’t like the resubmit without change, as there most probably is something wrong with a sentence if it gets rejected by two reviews. However I like the approach of being able to fix and resubmit. I however would take a different approach: let the user select which sentences to re-submit, and once the user clicks a “Resubmit” button we automatically switch to the “Add” page, prefilling the selected sentences.

What should happen with the overview after re-submitting? I’d say it should stay there, together we could go down the road of marking these as “taken care off” and only showing open sentences not marked as taken care off at the top.

Would that help others as well? What do you think of my suggested approach?

Additionally, how should we handle notifying users about newly rejected sentences as suggested in Sentence Collector Open Discussions - Input needed? How would you want to mark them as resolved, keeping the above points in mind too? There seems to be some overlap here which we can tackle together I think.

Add a user filter to the Review section

Originally reported here:

This issue suggests to add a “created by” user filter on the Review page, so that only sentences by specific users could be displayed and reviewed.

Would that help others too? What would we need to consider to make sure sentences are not rejected based on who added them?

Warning about possible errors

Originally reported at:

This issue includes quite a few different checks, please read the original issue. What would be the most helpful checks that should be added to the rules files? How could these be done in an efficient manner?

Import from RSS

Originally reported here:

This is an old issue suggesting to add the possibility to import from an RSS feed. This could for example be used to import all articles from a personal blog.

Should this be implemented in the Sentence Collector? Would that better fit into the Sentence Extractor approach using our Quality Assurance process we already have there? What do you think?


Originally reported here:

This issue is requesting email notifications when new sentences are added. Before going into the mechanism of how this could potentially work, I’d like to ask the following questions.

Would notifications help you in your workflow of reviewing new sentences? What kind of notifications would you like to have? What cadence should that be?

Reviewing by submission

Originally reported here:

Is there still a need to be able to review by a submission? How should that process look like? How could we make sure that submissions are not falsely identified as “generally bad” after reading one sentence of it?

Looking forward to your thoughts!


Currently we can easily notice if someone uploaded in-appropriate materials such as novel or copyrighted news, because of the order remain the same as content. If we randomize the sentences than it will be extremely difficult to find those problems, so I don’t agree it. Also I don’t think review sentences should be “fun”.

Yes this is help. In most case if one people submit copyright sentences such as a whole novel, reject them at once will save us a lots of time. We can simple provide a route to feedback If people got rejected due of non-appropriate reason, I don’t think that would be a blocker for this. To save core contributor’s time as they probably review 80% of sentences should be our priority.

I don’t think we should keep this as it often make new people confused on if they had submitted successfully or not.

Agreed about not randomizing by default due to licensing fun;
RSS - probably sentence extractor, but will need to be manual per each RSS feed with QA on each;
Rejected processing - once a sentence had been rejected, should stay rejected. One should be able to resubmit as new sentence, could be possible to insert a input field straight into the rejected sentence page on request, allowing to edit and submit directly each sentence individually. No need to redirect to normal add sentence page - source can be autogenerated from the original, “my own work” button is kinda implicitly checked for this case.

I believe you are talking about the additional steps when submitting new sentences here? Agreed that those can be confusing, but the mentioned issue is about something different - but isn’t it basically identical to “filter by user”, just with greater granularity?

Hi, I submitted this issue. I didnt realise reviewing sentences includes verifying copyright, this is not very clear in the instructions… I agree that for this point retaining original order is important. Also for other datasets, e.g. when the sentences form a story or some such it is just nicer to read through.

I find it worrisome, if the sentence collecting shouldn’t be fun, I think for all intents, if working through the sentences is rather tedious task even at this point, it will be even worse for the recording and reviewing phase for the users. I don’t know about others, but I tend to go to the language data curation kind of tasks that I enjoy, I work on these for free on my spare time after all, and there’s easily a ton of fun nlp related data curation tasks for smaller languages at all times.

1 Like

There are two approaches -

If the import is just run one-time, I think it’s fine but seems not a big difference to manual copy-paste.

If the import are continuance running, then I will worried about the licenses problem it brings, such as if someone set an import to copyrighted rss (than we will need manual continuance rejecting a lot of sentences) or the source change their license after some time (it’s not easy to keep checking for licenses). Also manual line-break is still in need.

consider the risk, perhaps what we should provide is not import RSS but instruction of “how to export your blog manually and turned it into editable source txt”.

just drop a notice every week that “new sentences is available to review” will help.

I would like to suggest a feature for locales to add a short message at “Add sentences” and “Review sentences”, to shows to the contributors, make us able to point them to the local community, chatroom, this discourse or local meetup.

Take our HK contributor @hyxibg5lez as example, they had tried to contributed to Common Voice for months, before they eventually found this Discourse, connected to other local contributors and are able to resolved some of his questions and blockers, and he is now the core contributors leading the progress on Cantonese in last few months.

I believe connected the contributors together, can rise the engagement, ease the onboarding, reduce the submitted problematically sentences, and keep contributing for much longer time.

Thanks for the answers here so far, keep it coming!

I think it’s really hard to verify the status of copyright given sentences, no matter if they are in the same order as submitted or not. Eventually after reviews there might only be a few left in the queue, and getting harder and harder to identify.

I think copyrighted sentences shouldn’t be downvoted one by one, they should be reported (for example here), and we can delete all of them at once.

In any case, reviewing by submission and not for the full user is IMHO better for that case if we want to go down that road. I’d say it’s more likely that a submission is from the same source, and not necessarily all sentences from a given user. But again, I think copyrighted sentences should be able to be reported and we take care of it to make sure we can delete all of them and do not miss any.

@irvin can you elaborate on that? Reviewing by submission would be new, so I’m not sure what you are referring to as “keeping it”. What exactly is confusing contributors?

We also have some posts on Discourse with review guidelines which could be linked there. I think that would be a good thing indeed, but that will require l10n infrastructure to be set up. Edit: might be tricky though, as this would need a mechanism to create that message, but it’s bound to the language you’re submitting for, not necessarily for the language of the GUI.

We can make it as simple as a JSON file at the repo containing a paragraph of messages for all locales, and the contributor can file a PR to add or modify the contents for each locale. We can simply ask people at discourse or matrix to review the PR if it’s made by people we don’t know yet.

I was thinking you are refer to this view, it’s easy to get confused if the sentences had been submit or not. At the first time I got here, I clicked Review, and think that they had all been submited (but not?).

If the sentences is in the sequence, you can easily aware if it’s came from well-known stories or materials, or you can know which sentences are connected and it’s easier to googling them,


1. Where did you find that apple?
2. Teresa, Mildrid, Ralph, and Vonda all arrived yesterday evening.
3. Sometimes I wonder how there can be so many stars in the sky.
4. Does the venue offer free WiFi?
5. President Herbert Hoover had two children.
6. Turning the envelope over
7. his hand trembling
8. Harry saw a purple wax seal 
9. bearing a coat of arms
10. a lion, an eagle, a badger
11. and a snake surrounding a large letter 'H'
12. Harry Potter has never even heard of Hogwarts 
13. when the letters start dropping on the doormat
14. Addressed in green ink on yellowish parchment
15. with a purple seal
16. He gained fame as an Italian racing cyclist.
17. His saxophone solo was incredible.
18. I woke up to the screech of my neighbor practicing trumpet.
19. The juggler managed to keep five balls in the air at once.
20. Please shut the door.

you can easily aware that a paragraph of sentences on L6-15 should be comes from some novel, but if we shuffle them like below, it’s much much harder to notice it.

1. The juggler managed to keep five balls in the air at once.
2. a lion, an eagle, a badger
3. Please shut the door.
4. Where did you find that apple?
5. Does the venue offer free WiFi?
6. I woke up to the screech of my neighbor practicing trumpet.
7. Addressed in green ink on yellowish parchment
8. when the letters start dropping on the doormat
9. He gained fame as an Italian racing cyclist.
10. Sometimes I wonder how there can be so many stars in the sky.
11. and a snake surrounding a large letter 'H'
12. bearing a coat of arms
13. his hand trembling
14. Turning the envelope over
15. His saxophone solo was incredible.
16. President Herbert Hoover had two children.
17. Harry saw a purple wax seal
18. Teresa, Mildrid, Ralph, and Vonda all arrived yesterday evening.
19. with a purple seal
20. Harry Potter has never even heard of Hogwarts

Thanks for the example regarding the review confusion. The points here are meant for the review page, not for the submission review. I can totally see how the submission review steps can be confusing. I’ll create a new thread for that when I get to it, I have a suggestion for that.

[quote=“irvin, post:13, topic:63925”]

Created a new discussion here: