Discrepancy in recorded/validated hours

daniel.abzakh · January 27, 2022, 6:27am

There is discrepancy in the number of recorded and validated hours for Abkhazian.

That is unfortunate, instead of bringing the good news to the community of the release, now I have to explain to them why we lost these hours!

Could you give us an explanation please? What happened?

What we had (latest was 106/74):

What we got:

stergro · January 27, 2022, 11:17am

I have seen similar things after other releases before. Usually, this was because they recalculated the average time of a recording. The numbers are not completely accurate, they simply use the number of recordings and multiply them with the average length of a recording. If you add shorter sentences or the people talk quickly, then these numbers can change. Sentences from the Wikipedia extraction are typically very long, manually collected sentences from other sources are typically a lot shorter.

But this is just a theory, it also could be an error.

bozden · January 27, 2022, 12:40pm

What @stergro says is most probable. The metadata is not out yet thou to look at… The graphs are estimations calculated from last metadata, recordings x average_duration_in_last_metadata… It gets recalculated during dataset release…

Or bulk removal due to CC0 issues, if any, can be another source…

daniel.abzakh · January 27, 2022, 1:24pm

Recalculating should give similar results because it should be the same formula, if there is discrepancy, this means there is a bug somewhere. Bulk removal can’t be the issue in this case.

I need to hear from the Common Voice team, I can’t give speculations to the community. A lot of work is undergoing and people volunteering their time, every minute we record and validate counts.

@heyhillary @EM

heyhillary · January 27, 2022, 1:34pm

Hey Daniel,

Thank you for sharing this issue. Our team is currently investiagting the issue. We hope to respond as soon as possible. Please note that the leaderboard is an estimation of hours contributed.

I will follow up with you for a more comprehensive report regarding your query.

Sorry for any inconvencies caused.

heyhillary · February 2, 2022, 10:52am

Hey @daniel.abzakh,

Thanks so much for your patience.

Our team investigated your query and provided the following response.

The leaderboard on the website, is a rough estimate of totalClips * averageClipDuration.

In the specific case for Abkhazian, the average clip duration went down over releases from 6.41s to 5.127s which greatly affected the total hours calculation.

If you have any questions please feel free to ask.

daniel.abzakh · February 2, 2022, 1:09pm

Hey @heyhillary,

Thank you for your reply.

I can see that’s happening, we did make sure the sentences are shorter in this release.

I think I will open an issue for this in Common Voice github, averageClipDuration should be updated more often to give realistic numbers.

We use these numbers in our campaign, also we monitor events to see how far we got, the numbers shouldn’t be way off.

bozden · February 2, 2022, 3:52pm

Hi @daniel.abzakh, this is the method I’ve been using:

Calculate average char duration from the latest dataset metadata and text-corpus.
While adding new text corpus try to mix long sentences with shorter ones, mix them randomly and calculate the expected duration. I tried not to deviate too much.
If they are different keep in mind that would result in x% change from shown.

Not ideal, but using this method I could keep the difference from the predicted duration under 1 hour (dropped from 65.x hours validated).

I think the real problem the CV engineers will be facing is the CPU resources and disk bandwidth required while calculating the real duration from mp3 files in bulk. I don’t know how they are doing now, but they must be running on an offline copy…

I see they are moving to 3 monthly releases, that might help.

Also, your dataset was rather small in v7.0, as it is larger now, any difference will have less effect on the total.

daniel.abzakh · February 2, 2022, 4:38pm

This is a great idea, I will look into it.

Maybe.

Belarusian dataset is fairly large, but they probably have a similar issue with 100 hours difference.

Andrej · February 3, 2022, 4:58pm

It’s true. The difference increases proportionally

heyhillary · February 4, 2022, 2:07pm

Hey @daniel.abzakh and @bozden,

Thanks for flagging this issue, we understand the need to have clearer public analytics for Common Voice. This is an ongoing project, but as a first step a team member has fixed the grafana dashboard you can now see several key stats - including new contributors, daily clips, top locales by hours and more - on this Grafana dashboard..

The board has been pulls data from the platform rather than estimates. I hope this helps with evalution and supporting community efforts.

If you have any questions please let me know.

bozden · February 4, 2022, 2:16pm

Oh my! That is awesome! Thank you!

Topic		Replies	Views
Why do total hours differ? Common Voice	3	527	September 29, 2022
The recoded hours of Uyghur Language was reduced Common Voice	2	84	February 7, 2025
600 hours of audio is missing (?) in Bengali Common Voice	4	668	July 22, 2023
Why did the dataset have decreased? Common Voice	4	502	August 20, 2019
How accurate are the statistics of Recorded/Validated clips per language? Common Voice	5	458	July 5, 2021

Discrepancy in recorded/validated hours

Related topics