Skipped (hard) sentences are accumulating in German Common Voice, what can we do about it?

In the German Version of Common Voice the Wikipedia import has happened in 2019 and the Europarl import in early 2020. Since then, hundreds of hours have been recorded. But in the last months it looks like the number of very hard to pronounce sentences has increased a lot, especially sentences with many foreign words or hard to pronounce names are coming all the time right now.

My theory is that over time most easy sentences have been recorded and only the hard ones (that have been skipped several times) are left. I believe this is a problem because it demotivates donors.

Here are a few examples collected in just a minute. They are all corect German sentences, but hard to pronounce for many donors:

Auch die Seychellen beanspruchten einen Teil der Îles Éparses vor dem France-Seychelles Maritime Boundary Agreement.

Alice wurde in „Bromont” in der Nähe von Newburg, Charles Company, Maryland, geboren.

Dieses Verständnis wurde Sabellianismus und modalistischer Monarchianismus genannt.

Do we need a feature that removes sentences that have been skipped too often? Should we remove these sentences manually? Should we simply import more sentences via the sentence collector? What do you guys think about this?


I think it’s a great idea to “retire” sentences that have been skipped too often. Perhaps a threshold of 10 skips or so? Removing them is probably not a good idea, because maybe someone has recorded them. Also it might be useful information for any future classifier.

It’s also a good idea to add more sentences of course.

Yes, it’s a good idea. And we need a way to get the number of sentences pending to record, too.

