Hi,
This week, Uyghur’s recorded hours were reduced by 22 hours. Could anyone explain this change? thanks.
Hi,
This week, Uyghur’s recorded hours were reduced by 22 hours. Could anyone explain this change? thanks.
Hey Osman, a change happened in all datasets, because the figures in the languages page were only approximations.
It is caused by this PR.
Previously, to speed-up stuff, a simple table of average seconds per recording per language was used, and if the language is newer a global average came in play. Now it displays the (near-)correct value.
Previously, it was calculated from a global average in v9.0 (April 2022 - which had 93 datasets). The global average was:
const AVG_CLIP_SECONDS = 4.694
It is again “near-exact”, because the value is ROUNDED UP.
In your case, in v9.0 you had this average: ug: 6.031,
, but afterwards probably you added shorter sentences and the average dropped - now showing more correct data.
In our case (Turkish) we got a bump from 127h to 132h as our average is higher (about 4%).
Beware, the release data (cv-dataset repo = metadata) should show the correct data, the languages page was for display purposes. For releases, it should also be correct in my Metadata Viewer (where I show cv-dataset values) and Dataset Analyzer (where I re-calculate) webapps.
Hope this explains it…
Hi @bozden thank you very^N much for your detailed explanation. As always you are very helpful.