Are sentences for reading choosen really randomly?

I check recordings in Scripted Speech mode and in Russian there are sentences that are recorded, I guess, x10 more often than others. Some of them include:

  • Добровольчество — это не панацея.
  • Не боясь расправы, студенты скандировали слово «свобода».
  • Масштабы финансово-экономического кризиса и темпы его распространения застали самых опытных специалистов мира врасплох.

Is it bug or they were first sentences in Russian dataset, and there weren’t other sentences to record in the beginning and this is the reason? Because I meet them so often that I literally remember them by heart already XD

Well, true randomness is a real problem in computer science :slight_smile:

The life of a dataset can be very fruitful or very traumatic, and in some cases will result into such mishaps…

We need to analyze the whole life-line of the dataset to find out when and why this happened…

But consider this:

  1. A dataset begins with 5000 sentences
  2. 1000 people come and record all sentences
  3. Because of the low sentence count, each one will be recorded multiple times - and currently max recording per sentence is 15.
  4. A year later 100k sentences got added, and they would be presented to the users for recording, each one gets one recording - randomly…
  5. You get a release, some 5000 sentences have 10 recordings, others have a single recording.

For SCS, all starts with the text corpus, the quality of the dataset is mainly defined by it…

Currently, for a year now, I think we are providing a nice randomness, for both recording and validation. Data (20k sentences) gets cached for a while and we select a random batch to present the user for recording.

You can get information of your datasets in these webapps - until v22 (I had no time to update them later)…

I know, I meant it just works good enough and doesn’t show same sentences too many times while recording :upside_down_face:

There are 14 sentences that have >100 recordings in Russian dataset, while most of others have <10 recordings between v18-v22. And 27 sentences with more than 100 recordings in v25. But without ‘single-word benchmark sentences’ it will be 13-14 again, so maybe this is because of them.

In English as far as I can see from https://analyzer.cv-toolbox.web.tr/ there is same problem. There are 560 sentences that have >100 recordings in v22 and 3,094 with [50 - 100) recordings, while >95% sentences (1,086,526) have only one recording. It doesn’t seem as a well-working random. But I’m not sure if it was only in the past or is relevant till now

Anyway, thanks for the answer!

Those are the single-word, every dataset from that time has them.

Yes, it got better in time, I was not around till 2021, but those SQL SELECT were definitely changed in time… It was also happening during validation See:

There are several problems with this:

  • RAND in SQL is very slow, especially when the data is hard to find (large corpus/few left). That also makes Redis caching is a necessity…
  • We cannot increase up cache time, that would result in many people recording the same sentence
  • …etc…

So at some point you need to find a middle point which would cover diverse nature of 300 different datasets. A while ago (8-9 mo) I started to devise an adaptive caching algorithm, but got stuck with other projects before completing.