Discrepancy in recorded/validated hours

@heyhillary
Hello Hillary,

There is discrepancy in the number of recorded and validated hours for Abkhazian.

That is unfortunate, instead of bringing the good news to the community of the release, now I have to explain to them why we lost these hours!

Could you give us an explanation please? What happened?

What we had (latest was 106/74):

What we got:

1 Like

I have seen similar things after other releases before. Usually, this was because they recalculated the average time of a recording. The numbers are not completely accurate, they simply use the number of recordings and multiply them with the average length of a recording. If you add shorter sentences or the people talk quickly, then these numbers can change. Sentences from the Wikipedia extraction are typically very long, manually collected sentences from other sources are typically a lot shorter.

But this is just a theory, it also could be an error.

2 Likes

What @stergro says is most probable. The metadata is not out yet thou to look at… The graphs are estimations calculated from last metadata, recordings x average_duration_in_last_metadata… It gets recalculated during dataset release…

Or bulk removal due to CC0 issues, if any, can be another source…

2 Likes

Recalculating should give similar results because it should be the same formula, if there is discrepancy, this means there is a bug somewhere. Bulk removal can’t be the issue in this case.

I need to hear from the Common Voice team, I can’t give speculations to the community. A lot of work is undergoing and people volunteering their time, every minute we record and validate counts.

@heyhillary @EM

Hey Daniel,

Thank you for sharing this issue. Our team is currently investiagting the issue. We hope to respond as soon as possible. Please note that the leaderboard is an estimation of hours contributed.

I will follow up with you for a more comprehensive report regarding your query.

Sorry for any inconvencies caused.

2 Likes

Hey @daniel.abzakh,

Thanks so much for your patience.

Our team investigated your query and provided the following response.

The leaderboard on the website, is a rough estimate of totalClips * averageClipDuration.

In the specific case for Abkhazian, the average clip duration went down over releases from 6.41s to 5.127s which greatly affected the total hours calculation.

If you have any questions please feel free to ask.

2 Likes

Hey @heyhillary,

Thank you for your reply.

I can see that’s happening, we did make sure the sentences are shorter in this release.

I think I will open an issue for this in Common Voice github, averageClipDuration should be updated more often to give realistic numbers.

We use these numbers in our campaign, also we monitor events to see how far we got, the numbers shouldn’t be way off.

3 Likes

Hi @daniel.abzakh, this is the method I’ve been using:

  • Calculate average char duration from the latest dataset metadata and text-corpus.
  • While adding new text corpus try to mix long sentences with shorter ones, mix them randomly and calculate the expected duration. I tried not to deviate too much.
  • If they are different keep in mind that would result in x% change from shown.

Not ideal, but using this method I could keep the difference from the predicted duration under 1 hour (dropped from 65.x hours validated).

I think the real problem the CV engineers will be facing is the CPU resources and disk bandwidth required while calculating the real duration from mp3 files in bulk. I don’t know how they are doing now, but they must be running on an offline copy…

I see they are moving to 3 monthly releases, that might help.

Also, your dataset was rather small in v7.0, as it is larger now, any difference will have less effect on the total.

4 Likes

This is a great idea, I will look into it.

Maybe.

Belarusian dataset is fairly large, but they probably have a similar issue with 100 hours difference.

3 Likes

It’s true. The difference increases proportionally

1 Like

Hey @daniel.abzakh and @bozden,

Thanks for flagging this issue, we understand the need to have clearer public analytics for Common Voice. This is an ongoing project, but as a first step a team member has fixed the grafana dashboard you can now see several key stats - including new contributors, daily clips, top locales by hours and more - on this Grafana dashboard..

The board has been pulls data from the platform rather than estimates. I hope this helps with evalution and supporting community efforts.

If you have any questions please let me know.

3 Likes

Oh my! That is awesome! Thank you!

1 Like