High amount of low quality submissions in Sentence Collector makes reviewing boring

In the review queue for Swedish submissions in Sentence Collector, there are thousands of entries with the source “Project Gutenberg, with slight tweaks by me.”. These seem to be based on texts with old style grammar (like plural verbs that haven’t been used in Swedish since the first half of the last century and word order that almost all people would find odd nowadays) and spelling that are not nearly tweaked enough to work satisfactorily to be used in Common Voice. There are also sentences with incorrect capitalization of text and spurious full stops in the middle of sentences.

It’s a real chore to go through and reject a majority of entries that all suffer from similar problems. Before it was possible to skip ahead in the queue and review things later on in the queue which I miss now. Is there some other way to approve sentences in Sweden while avoiding going through all these I’d rather skip?

Hi there,

At this point there is not, no. However, I think the main question here is: what does the ratio of good sentences vs. bad sentences look like? If the full data source does not provide much value and just contributes to frustration during review, it might be worth it to remove not-yet-reviewed sentences completely. What do you think?

I just want to agree here.

Even in cases where old sentences are pretty similar to modern spelling and wording, they still often use commas in ways we don’t nowadays. This isn’t really an issue per se I guess, but you can very often hear people reading these sentences awkwardly, making “pauses” in ways you never do in normal speech, all because of the (by modern standards) awkwardly placed comma. If these commas weren’t there, the recordings would probably have higher quality.

An example that works in English as well could be “He thought, that this was an awkward comma”. We used commas like this back in the 50s or so, but not anymore. People really stumble on commas like these.

I think my suggestion would then be to remove all remaining, not yet reviewed sentences from that source. @ftyers would you agree?

Agree with @mkohler, the best thing to do is remove sentences from that source.

I have now deployed a version deleting these sentences. Note that deployment might take a while.

Have they been deployed by now? In the queue I still see 1000 similar sentences with the same source “Project Gutenberg, with slight tweaks from me.” and outdated (or incorrect) grammar. Are these new entries from the same user or the old ones?

Mh, looks like it’s the old ones. I’ll have a look in the next few days.

For reference: https://commonvoice.mozilla.org/sentence-collector/sentences/sv-SE?source=Project%20Gutenberg,%20with%20slight%20tweaks%20by%20me.

