Why did the dataset have decreased?

A couple of days ago English had more than 900 hours validated, German around 400 hours, but now English has 846 hours and German 357 hours, what was the change in the dataset? removal of duplicated sentences or something else?

I noticed this too but thought it was a bug because the overall total of all languages combined stayed the same.

Kabyle lost around 40 hours just like that. A big loss for a small language. What happened?


good news: we didn’t lose any clips.

What we did do in the past is calculate the hours on the homepage (and languages page) based on an average clip length. The average we used was based only on english and the sentences we had back then, which gave us 4.7s.
Now we’ve changed that to use averages per locale, based on the latest dataset release. A lot of languages have an average clip length lower than 4.7s, so the hour estimate went down for them.

We’re looking into saving the actual length of the clip in the database in the future, so that we don’t have to estimate anymore.

Hope that all makes sense!


@gregor I initially thought that the duplicates were removed. Thanks for the clarification.