Are sentences for reading choosen really randomly?

Libra · March 24, 2026, 4:57pm

I check recordings in Scripted Speech mode and in Russian there are sentences that are recorded, I guess, x10 more often than others. Some of them include:

Добровольчество — это не панацея.

Не боясь расправы, студенты скандировали слово «свобода».

Масштабы финансово-экономического кризиса и темпы его распространения застали самых опытных специалистов мира врасплох.

Is it bug or they were first sentences in Russian dataset, and there weren’t other sentences to record in the beginning and this is the reason? Because I meet them so often that I literally remember them by heart already XD

bozden · March 24, 2026, 7:24pm

Well, true randomness is a real problem in computer science

The life of a dataset can be very fruitful or very traumatic, and in some cases will result into such mishaps…

We need to analyze the whole life-line of the dataset to find out when and why this happened…

But consider this:

A dataset begins with 5000 sentences
1000 people come and record all sentences
Because of the low sentence count, each one will be recorded multiple times - and currently max recording per sentence is 15.
A year later 100k sentences got added, and they would be presented to the users for recording, each one gets one recording - randomly…
You get a release, some 5000 sentences have 10 recordings, others have a single recording.

For SCS, all starts with the text corpus, the quality of the dataset is mainly defined by it…

Currently, for a year now, I think we are providing a nice randomness, for both recording and validation. Data (20k sentences) gets cached for a while and we select a random batch to present the user for recording.

You can get information of your datasets in these webapps - until v22 (I had no time to update them later)…

Libra · March 27, 2026, 1:30am

I know, I meant it just works good enough and doesn’t show same sentences too many times while recording

There are 14 sentences that have >100 recordings in Russian dataset, while most of others have <10 recordings between v18-v22. And 27 sentences with more than 100 recordings in v25. But without ‘single-word benchmark sentences’ it will be 13-14 again, so maybe this is because of them.

In English as far as I can see from https://analyzer.cv-toolbox.web.tr/ there is same problem. There are 560 sentences that have >100 recordings in v22 and 3,094 with [50 - 100) recordings, while >95% sentences (1,086,526) have only one recording. It doesn’t seem as a well-working random. But I’m not sure if it was only in the past or is relevant till now

Anyway, thanks for the answer!

bozden · March 27, 2026, 7:51am

Those are the single-word, every dataset from that time has them.

Yes, it got better in time, I was not around till 2021, but those SQL SELECT were definitely changed in time… It was also happening during validation See:

github.com/common-voice/common-voice

Increase randomness during validation

opened 03:08PM - 05 Dec 22 UTC

closed 09:31PM - 26 Dec 25 UTC

HarikalarKutusu

Enhancement

**Is your feature request related to a problem? Please describe.** * We've be…en validating as a group at the same location for the next release. We know that there are around 4k recordings to be validated. * We get batches of 23-24 after refresh, but everyone gets the same recordings, so 3-4 people give OK for the same sentence (the amount can be larger if we think of the greater community we called for help). This is a large amount of loss in volunteer work as 2 votes will be enough most of the time. It seems the selection from the database is not random enough. **Describe the solution you'd like** For each reload request, a level of randomness should be introduced.

github.com/common-voice/common-voice

[FR] Limiting high vote counts during validation

opened 11:01AM - 25 Jul 23 UTC

closed 07:33AM - 11 Jan 26 UTC

HarikalarKutusu

The problem and solution ideas are [in this Discourse post](https://discourse.mo…zilla.org/t/questioning-the-2-votes/107249). In addition to those, I want to emphasize, that the CorporaCreator already eliminates these [with this code](https://github.com/common-voice/CorporaCreator/blob/37fe9a2787e27d677f0b8f360992074c834bcb2e/src/corporacreator/corpus.py#L73). So, even if they get validated at the end, they will not reach the ``validated.tsv file.

There are several problems with this:

RAND in SQL is very slow, especially when the data is hard to find (large corpus/few left). That also makes Redis caching is a necessity…
We cannot increase up cache time, that would result in many people recording the same sentence
…etc…

So at some point you need to find a middle point which would cover diverse nature of 300 different datasets. A while ago (8-9 mo) I started to devise an adaptive caching algorithm, but got stuck with other projects before completing.

Topic		Replies	Views
French - Always addresses Common Voice feedback	3	870	March 8, 2019
Russian senteces are low quality Common Voice sentence-collection , dataset	3	1479	December 26, 2019
Single Sentence Record Limit feature release Common Voice announcements	18	3171	June 13, 2022
Book-reading mode (aka "ordered sentences collections") Common Voice participation , sentence-collection , feedback , dataset	3	1548	January 2, 2021
Why train.tsv includes a few files (just 3% of validated set)? Common Voice dataset	21	6199	February 26, 2020

Are sentences for reading choosen really randomly?

Related topics