The recoded hours of Uyghur Language was reduced

neouyghur · February 5, 2025, 11:48pm

Hi,

This week, Uyghur’s recorded hours were reduced by 22 hours. Could anyone explain this change? thanks.

bozden · February 6, 2025, 2:24pm

Hey Osman, a change happened in all datasets, because the figures in the languages page were only approximations.

It is caused by this PR.

Previously, to speed-up stuff, a simple table of average seconds per recording per language was used, and if the language is newer a global average came in play. Now it displays the (near-)correct value.

Previously, it was calculated from a global average in v9.0 (April 2022 - which had 93 datasets). The global average was:

const AVG_CLIP_SECONDS = 4.694

If your ACTUAL recording average is higher than this, you should see an increase
If your average is less, you will see a decrease.
If about the same => no recognizable change.

It is again “near-exact”, because the value is ROUNDED UP.

If the total recorded is < 10h, you would see a decimal. It has 6 min precision, rounded up. E.g. 04h:02min will be shown as 4.1
If the total recorded is >=10, no decimal point will be shown, but again rounded up, so 12:01 will be shown as 13.

In your case, in v9.0 you had this average: ug: 6.031,, but afterwards probably you added shorter sentences and the average dropped - now showing more correct data.

In our case (Turkish) we got a bump from 127h to 132h as our average is higher (about 4%).

Beware, the release data (cv-dataset repo = metadata) should show the correct data, the languages page was for display purposes. For releases, it should also be correct in my Metadata Viewer (where I show cv-dataset values) and Dataset Analyzer (where I re-calculate) webapps.

Hope this explains it…

neouyghur · February 7, 2025, 3:29am

Hi @bozden thank you very^N much for your detailed explanation. As always you are very helpful.