Sentence Collector Open Discussions - Input needed

Hi everyone

I’m cleaning up the issues in the Sentence Collector project, and moving open discussions here. We can then file more concrete issues once we came to a conclusion for those.

Happy to hear your thoughts!

Randomize Sentence Order in Review section

Originally reported at:

There seem to be instances where reviewing can get boring, as there are alphabetical lists of similar sentences right after each other. This is probably due to some list of sentences being uploaded. We are currently surfacing the most reviewed sentences first, in the same order as they were created.

Is this a general issue you’ve encountered? Do you think randomizing while keeping the most reviewed sentences at the front would improve the process?

Process rejected sentences

Originally reported here:

Currently rejected sentences are not actionable, as they are simply a list in their own section. The suggestion in this issue was the following:

  • Allow changes to the sentences and resubmit them from within the list instead of having to copy them over manually
  • Allow sentences to be marked as correct without change and resubmit them

My thoughts: I don’t like the resubmit without change, as there most probably is something wrong with a sentence if it gets rejected by two reviews. However I like the approach of being able to fix and resubmit. I however would take a different approach: let the user select which sentences to re-submit, and once the user clicks a “Resubmit” button we automatically switch to the “Add” page, prefilling the selected sentences.

What should happen with the overview after re-submitting? I’d say it should stay there, together we could go down the road of marking these as “taken care off” and only showing open sentences not marked as taken care off at the top.

Would that help others as well? What do you think of my suggested approach?

Additionally, how should we handle notifying users about newly rejected sentences as suggested in Sentence Collector Open Discussions - Input needed? How would you want to mark them as resolved, keeping the above points in mind too? There seems to be some overlap here which we can tackle together I think.

Add a user filter to the Review section

Originally reported here:

This issue suggests to add a “created by” user filter on the Review page, so that only sentences by specific users could be displayed and reviewed.

Would that help others too? What would we need to consider to make sure sentences are not rejected based on who added them?

Warning about possible errors

Originally reported at:

This issue includes quite a few different checks, please read the original issue. What would be the most helpful checks that should be added to the rules files? How could these be done in an efficient manner?

Import from RSS

Originally reported here:

This is an old issue suggesting to add the possibility to import from an RSS feed. This could for example be used to import all articles from a personal blog.

Should this be implemented in the Sentence Collector? Would that better fit into the Sentence Extractor approach using our Quality Assurance process we already have there? What do you think?


Originally reported here:

This issue is requesting email notifications when new sentences are added. Before going into the mechanism of how this could potentially work, I’d like to ask the following questions.

Would notifications help you in your workflow of reviewing new sentences? What kind of notifications would you like to have? What cadence should that be?

Reviewing by submission

Originally reported here:

Is there still a need to be able to review by a submission? How should that process look like? How could we make sure that submissions are not falsely identified as “generally bad” after reading one sentence of it?

Looking forward to your thoughts!


Currently we can easily notice if someone uploaded in-appropriate materials such as novel or copyrighted news, because of the order remain the same as content. If we randomize the sentences than it will be extremely difficult to find those problems, so I don’t agree it. Also I don’t think review sentences should be “fun”.

Yes this is help. In most case if one people submit copyright sentences such as a whole novel, reject them at once will save us a lots of time. We can simple provide a route to feedback If people got rejected due of non-appropriate reason, I don’t think that would be a blocker for this. To save core contributor’s time as they probably review 80% of sentences should be our priority.

I don’t think we should keep this as it often make new people confused on if they had submitted successfully or not.

Agreed about not randomizing by default due to licensing fun;
RSS - probably sentence extractor, but will need to be manual per each RSS feed with QA on each;
Rejected processing - once a sentence had been rejected, should stay rejected. One should be able to resubmit as new sentence, could be possible to insert a input field straight into the rejected sentence page on request, allowing to edit and submit directly each sentence individually. No need to redirect to normal add sentence page - source can be autogenerated from the original, “my own work” button is kinda implicitly checked for this case.

I believe you are talking about the additional steps when submitting new sentences here? Agreed that those can be confusing, but the mentioned issue is about something different - but isn’t it basically identical to “filter by user”, just with greater granularity?

Hi, I submitted this issue. I didnt realise reviewing sentences includes verifying copyright, this is not very clear in the instructions… I agree that for this point retaining original order is important. Also for other datasets, e.g. when the sentences form a story or some such it is just nicer to read through.

I find it worrisome, if the sentence collecting shouldn’t be fun, I think for all intents, if working through the sentences is rather tedious task even at this point, it will be even worse for the recording and reviewing phase for the users. I don’t know about others, but I tend to go to the language data curation kind of tasks that I enjoy, I work on these for free on my spare time after all, and there’s easily a ton of fun nlp related data curation tasks for smaller languages at all times.

1 Like

There are two approaches -

If the import is just run one-time, I think it’s fine but seems not a big difference to manual copy-paste.

If the import are continuance running, then I will worried about the licenses problem it brings, such as if someone set an import to copyrighted rss (than we will need manual continuance rejecting a lot of sentences) or the source change their license after some time (it’s not easy to keep checking for licenses). Also manual line-break is still in need.

consider the risk, perhaps what we should provide is not import RSS but instruction of “how to export your blog manually and turned it into editable source txt”.

just drop a notice every week that “new sentences is available to review” will help.

I would like to suggest a feature for locales to add a short message at “Add sentences” and “Review sentences”, to shows to the contributors, make us able to point them to the local community, chatroom, this discourse or local meetup.

Take our HK contributor @hyxibg5lez as example, they had tried to contributed to Common Voice for months, before they eventually found this Discourse, connected to other local contributors and are able to resolved some of his questions and blockers, and he is now the core contributors leading the progress on Cantonese in last few months.

I believe connected the contributors together, can rise the engagement, ease the onboarding, reduce the submitted problematically sentences, and keep contributing for much longer time.

Thanks for the answers here so far, keep it coming!

I think it’s really hard to verify the status of copyright given sentences, no matter if they are in the same order as submitted or not. Eventually after reviews there might only be a few left in the queue, and getting harder and harder to identify.

I think copyrighted sentences shouldn’t be downvoted one by one, they should be reported (for example here), and we can delete all of them at once.

In any case, reviewing by submission and not for the full user is IMHO better for that case if we want to go down that road. I’d say it’s more likely that a submission is from the same source, and not necessarily all sentences from a given user. But again, I think copyrighted sentences should be able to be reported and we take care of it to make sure we can delete all of them and do not miss any.

@irvin can you elaborate on that? Reviewing by submission would be new, so I’m not sure what you are referring to as “keeping it”. What exactly is confusing contributors?

We also have some posts on Discourse with review guidelines which could be linked there. I think that would be a good thing indeed, but that will require l10n infrastructure to be set up. Edit: might be tricky though, as this would need a mechanism to create that message, but it’s bound to the language you’re submitting for, not necessarily for the language of the GUI.

We can make it as simple as a JSON file at the repo containing a paragraph of messages for all locales, and the contributor can file a PR to add or modify the contents for each locale. We can simply ask people at discourse or matrix to review the PR if it’s made by people we don’t know yet.

I was thinking you are refer to this view, it’s easy to get confused if the sentences had been submit or not. At the first time I got here, I clicked Review, and think that they had all been submited (but not?).

If the sentences is in the sequence, you can easily aware if it’s came from well-known stories or materials, or you can know which sentences are connected and it’s easier to googling them,


1. Where did you find that apple?
2. Teresa, Mildrid, Ralph, and Vonda all arrived yesterday evening.
3. Sometimes I wonder how there can be so many stars in the sky.
4. Does the venue offer free WiFi?
5. President Herbert Hoover had two children.
6. Turning the envelope over
7. his hand trembling
8. Harry saw a purple wax seal 
9. bearing a coat of arms
10. a lion, an eagle, a badger
11. and a snake surrounding a large letter 'H'
12. Harry Potter has never even heard of Hogwarts 
13. when the letters start dropping on the doormat
14. Addressed in green ink on yellowish parchment
15. with a purple seal
16. He gained fame as an Italian racing cyclist.
17. His saxophone solo was incredible.
18. I woke up to the screech of my neighbor practicing trumpet.
19. The juggler managed to keep five balls in the air at once.
20. Please shut the door.

you can easily aware that a paragraph of sentences on L6-15 should be comes from some novel, but if we shuffle them like below, it’s much much harder to notice it.

1. The juggler managed to keep five balls in the air at once.
2. a lion, an eagle, a badger
3. Please shut the door.
4. Where did you find that apple?
5. Does the venue offer free WiFi?
6. I woke up to the screech of my neighbor practicing trumpet.
7. Addressed in green ink on yellowish parchment
8. when the letters start dropping on the doormat
9. He gained fame as an Italian racing cyclist.
10. Sometimes I wonder how there can be so many stars in the sky.
11. and a snake surrounding a large letter 'H'
12. bearing a coat of arms
13. his hand trembling
14. Turning the envelope over
15. His saxophone solo was incredible.
16. President Herbert Hoover had two children.
17. Harry saw a purple wax seal
18. Teresa, Mildrid, Ralph, and Vonda all arrived yesterday evening.
19. with a purple seal
20. Harry Potter has never even heard of Hogwarts

Thanks for the example regarding the review confusion. The points here are meant for the review page, not for the submission review. I can totally see how the submission review steps can be confusing. I’ll create a new thread for that when I get to it, I have a suggestion for that.

[quote=“irvin, post:13, topic:63925”]

Created a new discussion here:

Randomize the order will help a lot for me.

I tend to skip sentences that I’m not sure.
So as time goes by, the first few pages are tend to be the same sentences that I had skipped. And, yes, it’s boring.

The current interface that doesn’t allow you to fast forward the page number in large number also make it harder. If I can pass the first pages easily or jump to page no. 536 or no. 1380 easily, may be the randomized order is not necessary. But for the current UI, randomized order of sentences will help a lot.

日本語版: Sentence Collectorへの意見

Anything? I have a lot to say.

  • As for How to, the rules for each language are absolutely necessary.
  • A guide to each language (link to). At the very least, I think the Playbook and Collector's how-to should be translated.
  • Filter out specific sentences. For example, characters in a foreign language. Number of characters. We need rules for each language.
  • This relates to the guidelines, but should also refer to "grey areas". For example, some arrangements should be made for sensitive sentences, such as sexual language, politics and religion.
    • For example, let's say there is a sentence stating "facts of history". But it's really quite common for what is considered "governing" in country A to be considered "aggression" in country B. We need to understand that all people read the sentence.
    • Sex is an important expression for us humans. But it's also a pretty difficult issue. Some sentences should be painful for some people to utter, such as 私はエッチです (I'm lecherous.) or 彼女は彼に身を任せた (she gave herself to him.) or おっぱい (boobs). It should also be noted that the terms and conditions allow minors to read.
      • In fact, the Japanese source text says, 私よりもっとエッチな人もいて安心しました。 (I was relieved to see that some people were even more sexually active than me.) What do you guys think?
        • I admit that this is a poor translation of エッチetch. But when we say "エッチ", the Japanese definitely perceive something sexual in it. (And we read it aloud!)
  • Change the design of the review button. Why the thumb design? Yes, if the thumb is up, it's good; if it's down, it's no good. It's hard to tell. There are two meanings to this.
    1. First of all, we don't have that gesture in Japan. This means that there are probably other countries and regions of the world that don't have a thumbs up or thumbs down culture. Like services like YouTube, I always feel like I'm in a "foreign country" when I see these regional expressions. Sometimes I wonder if Common Voice is really trying to engage with people around the world.
      • It is, in a word, universal design. We are trying to collect distinctive data. But ironically, Common Voice itself should not be "distinctive".
    2. Secondly, it's misleading. The only difference is exactly "thumbs". Why didn't they simply use "yes" or "no"?
      • The "yes" button and the "no" button are mentioned in the How to. This adds to the extra confusion. Or were the buttons used to be "yes" and "no" buttons? Even if it is, though, I'd like to see the description changed.
      • Another cause could be that the blocks of sentence are close together. Sometimes the thumbs are all lined up in a row and I can't tell which thumb is which. We should at least draw a border or alternate the color of the blocks.
  • Change the color of the review button. Yes, because as I mentioned above, it's confusing. Ideally, it should be clearly distinguishable, such as "yes" for green and "no" for red (yes, these are combinations that are difficult for colorblind people to see. They should be other combinations, and "approval" and "rejection" should be clearly distinguishable in letters and symbols alone). I think it's a good design that turns black when we press it.
  • Corpus links. Ref: We need a text corpus link
  • A page per sentence where we can see the metadata. (Is this not realistic because of the huge number of them?)
  • It shows the exact number of unreviewed sentences.
  • The progress of total sentences and unreviewed sentences is visible, e.g., in colored bars.
  • "My added sentences" page. As in Rejected Sentences. This is handy when doing a self-review.
  • Discussion Button. Discuss with other users about sentences that the user cannot judge the review. (It might be better to create a page of the sentence only when this button is pressed. As I said before, the number of sentences is huge.)
    • As for one-sentence discussions, ideally they should be able to be done within the Collector tool. Discourse is too cumbersome with information. We need a page where we can comment on "specific sentences". Yes, that's what @irvin was suggesting in post #10.
  • Hide Button (if that's appropriate). I've mentioned this in We need a Q&A. It should also allow we to see only what we've hidden (i.e., "Hidden sentences" page).
  • The ability to search for sentences.
  • The ability to check the source of a sentence.
  • The ability to search for the source of a sentence.
  • The ability to check the user who added or reviewed the sentence.
  • The ability to flag the sentence. For later review. A "hold". For example, when we want to review a difficult sentence after we've consulted a dictionary.
    • A hold period will be set up. When the period expires, the flag will be removed and other users will be able to review the sentence.
  • The ability to add sentences from a text file (.txt).
    • There should be rules for file formatting as well. Like one sentence per line.
    • The text file can be previewed before it is submitted.
    • Filtered sentences are notified and we can see where they are caused. (e.g. letters turn red)
  • Dark Mode. Makes long hours of work easier.
    • Would it be faster to use Dark Reader or Midnight Lizard? (Sorry, I didn't try. Which means I don't want to put in an add-on!)
    • We may be willing to distribute user stylesheets. It is available in, for example, Stylus.

Random order

I sometimes look back and review the sentences I've "ignored", so it's a bit inconvenient for me to be random. But,

  • To flag the sentences we care about.
  • To search for sentences.

If these are possible, we might try to introduce them.

It's hard for me to notice mistakes when there are similar sentences in a row. Well, I generally agree that "boring" is a word.

But, as @irvin says in post #14, we should not overlook the benefits of making connections (relationships) between each sentence.

Rejected sentences

Users should show everyone why they are rejecting the sentence.

For example,

  1. Press reject button.
  2. The choices are displayed.
    • Incorrect. (e.g., misspellings, lack of.)
    • Inappropriate language. (e.g., sexual language, hate speech, etc.)
    • It's hard to pronounce.
    • Can't understand the meaning.
    • Other (user enters)
  3. Select and reject.

I agree with the idea of re-posting after fix. But who is going to fix it? The user who added the sentence? Another user?
Wouldn't the work get done faster if the rejected sentences were public and could be fixed by any user?

In any case, the re-posted sentence

  • can see that it has been reposted.
  • can see why it was rejected.

It should be like this. To maintain neutrality, it should be able to be reviewed by anyone other than the user who rejected it.

Protesting the rejection

I also think the user who added the sentence needs to be able to protest (required if the user wants to re-post the sentence without fixing it). For example, let's say it was rejected because of a "misspelling". But it could be that the user who rejected it just didn't know the words or grammar. Therefore,

  1. The sentence is rejected.
  2. The user who added the sentence presses "protest button" (or simply "publish button" or "discuss button")
  3. A sentence discussion page is created.
  4. Each user gives their opinion on the page.

I think it's appropriate to maintain neutrality with this kind of process.

User filters

Hmmm, are these filters shared by all users? Or is it configurable on a per-user basis? Personally, it's best to have both working. I think the filters we share should be carefully considered.

It's important to know clearly from the Collector tool which users who have added and reviewed the sentence. But when reviewing a sentence, the user's information becomes noise. The "Review" screen hides it, and the "Search" screen shows the metadata of the sentence (user, source, etc.). In this way, there should be a distinction between judgments about sentences and judgments about users.

Why do we want to use user filters? That's the focus. Most reasons to filter users is because of a problem with the sentence. Therefore,

  • Rejected sentence
  • Reposted sentence
  • Sentence with copyright issues

For the above, allow the users involved to be added to all the users' individual filters. Then, for the most problematic users, add them to a shared filter.

I would rather have a source filter. If there is an alleged copyright violation in a sentence, we can exclude it.

I think @irvin's opinion in post #3 is reasonable. Yes, even if we make the source searchable, not all users will submit a "source"... I agree with the filtering itself.


I think it's an option. I don't need it (because I'll check it myself).

For example,

  • Two hundred sentences were added today!
  • Twenty sentences need user comments.

The frequency of sending is also important. A day, a week, a month. Or every time a sentence is added? We might want to have an option to notify, "only rejected sentences" or "only sentences that require discussion".


Maybe we don't need a self-review when we upload. If we' re able to do a self-review anyway. There would be no reason to interrupt the upload process. This should have been written in Sentence Collector - Review before Submit. Sorry.

Tell people from the platform

On the platform, let people know that the text is also being collected by volunteers. Perhaps people who only record voices and their validation don't know about it. Currently, when we run out of sentences to read, we are guided to the Collector tool. But I think sentence collection is a matter that should be mentioned by the platform. Because in fact, sentence collection is just as important as recording!

1 Like