Recent Persian sample sentence submissions

zareami10 · March 10, 2022, 6:42am

I recently noticed a huge amount of sentences (>200k) waiting in the review queue, presumably added from a dictionary for Persian.
Although generally not a bad thing, most of the words are obscure/quiet rare and with a lot of spelling mistakes. And the problem gets worse knowing the current existing sentences are quite biased (the amount of colloquial sentences is below five percent compared to (usually old) text-book and written Persian, which differ a lot in form), and this huge review queue makes it impossible to add more diverse sentences on small frequent basis, considering this small contributor base.

What would a proper solution be?

mkohler · March 11, 2022, 6:22pm

This is in the Sentence Collector, right? If so, we can delete those if they do not provide value. Do they all have the same “source” displayed? If so, what is it? And are all the “sentences” from that source to be removed?

zareami10 · March 12, 2022, 4:45pm

Yes they are from the sentence collector. I can’t say that they don’t provide any value, but well at the current rate reviewing them all is almost impossible and the quality is not high enough for bulk submission, so they might be doing more harm than good.

All the sentences I have been reviewing recently mention “self-prepared sentences”, which I suspect constitute a big portion of those 250k submissions considering they are all dictionary-like entries (but of course I can’t be sure since I cannot see all the sentences).

It would be nice if we could temporarily “hold” these submissions for later reviews and revisions (maybe by simply putting them in a separate directory that are not exported to CV?), but if that’s not possible I think removing them might be the only option at the moment, provided that this source is actually the cause of this huge queue.

mkohler · March 13, 2022, 3:21pm

Currently there are 237,663⁩ sentences left to review. There are 79269 with the source indicated as “self-prepared sentences”. These were uploaded in 12 different submissions (IDs just for my own future reference):

6d37f2c0-12dc-4370-a4db-a3063a8954b6
8b6d5a9a-6512-4f51-81ad-fb5ff312de9a
282c3350-78b6-4943-9d35-75953a9a4346
c14825f2-9c3d-4fe3-acae-5f8522cd3b03
c938ea72-74a7-43d3-85b1-f355ab526c84
adbf5670-da0c-4468-b108-c0bd3c3888dd
da9653d4-ba14-401f-8811-c7f6b62390b0
c8024715-2a3e-4550-859b-d8048ac358ed
6bb2b75c-887d-44b2-89d0-0788d13ad047
33deb172-d814-4ce3-ad35-b6f34e134322
60814880-3f6b-4605-9f40-faa9ee70b38f
4acb19ce-120a-43e3-be2a-c78d723526c2

So there must be more sources that recently got added. As I do not have access to the database, I can’t say which ones those are though.

It would be nice if we could temporarily “hold” these submissions for later reviews and revisions (maybe by simply putting them in a separate directory that are not exported to CV?)

That is currently not possible. Something like a quarantine might be a good idea for the future (just might), but currently there is no flag for that and Sentence Collector has a single database. Though that’s just a technical limitation, identifying the actual sources and submissions that contain these sentences is way trickier.

zareami10 · March 15, 2022, 3:03pm

I see, thanks.

So we would have to delete these to be able to see the other (presumably huge) submissions?

Topic		Replies	Views
High amount of low quality submissions in Sentence Collector makes reviewing boring Common Voice sentence-collection	8	1107	April 12, 2023
Support needed to get more sentences in Persian Common Voice sentence-collection	3	3127	May 18, 2020
Grammatically poor sample sentences Common Voice sentence-collection	23	1983	April 29, 2019
Mass import sentences into Sentence Collector Common Voice sentence-collection	5	650	February 7, 2019
Difficulties on using sentences collection tools on importing big amount of sentences 華語（台灣） (zh-TW)	12	78	April 2, 2019

Recent Persian sample sentence submissions

Related topics