Request: Number of not recorded sentences by language

jmontane · May 29, 2020, 10:49am

How many strings/sentences are pending for recording by language?
This info is useful to made estimations for searching or generating texts and phrases before finishing recording queue is drained.

nukeador · May 29, 2020, 10:48am

Right now we don’t have an automated way to show this, and usually what we do is estimation based on the current number of sentences and the current number of hours recorded, with an average of 5s per sentence.

Buffer of hours = (Number of current sentences * 5) / 3600
Hours left = Buffer of hours - Current recorded hours
Sentences left = (Hours left * 3600)/5

So for example, for Catalan there are 460481 sentences and 493 hrs recorded.

(460481 * 5) / 3600 = 639,55 hrs
639,55 - 493 = 146,55 hrs left
(146,55 * 3600)/5 = 105516 sentences left

hyxibg5lez · May 30, 2020, 6:08pm

How many times a sentence would be recorded?

nukeador · June 1, 2020, 12:05pm

Ideally it should be recorded just once, which has demonstrated to result into better quality on model training.

We don’t have that limitation built-in on the site yet, but that’s something we want to have at some point.

hyxibg5lez · June 1, 2020, 6:05pm

Thanks for the explanation.

As the system doesn’t enforce this limitation, would “Sentences left” in the above calculation underestimated? Or the figure “Current recorded hours” considers only 1 clip (among multiple clips by whatever criteria) per sentence?

nukeador · June 2, 2020, 11:37am

Current recorded hours considers all clips.

Doing the math you can easily see also languages that already have recorded more hours than sentences available, which will signal how many repetitions there are (instead of sentences left).

hyxibg5lez · June 4, 2020, 3:32am

The CV UI shows the number of clips of a language recorded by all contributors:

Can we sum all the figures here to get an accurate number of clips, instead of estimating it from the recorded hours?

nukeador · June 4, 2020, 12:04pm

It’s quite the other way around we estimate the number of hours from the number of clips we have. Improvements to the dashboard can be made, but they need to be first analyzed and prioritized.

If you have a specific problem/need it would be interesting to explore first it, and then see possible solutions.

hyxibg5lez · June 4, 2020, 8:06pm

Sorry for the confusion. I am not asking to update the UI. I am figuring out a way to answer the original question:

How many strings/sentences are pending for recording by language?

You replied that we firstly need to know the number of sentences recorded and this can be estimated from the hours of clips. This is just an estimation and the possibility of sentences be recorded multiple times makes it even inaccurate. I am thinking another solution is to sum the figures shown in my screen shot, given that we have the api to retrieve those figures. No, I am not asking to update the UI. You tell us the api, if available, and we do the maths ourselves.

I am very interested in this question as I suspect my language (zh-HK) has already run out of sentence for recording.

jmontane · January 30, 2021, 8:22am

Common Voice has 3 roles. 2 roles are clear (speaker and listener), but one role is silent: sentence submitter.

Without sentences to record, there aren’t clips to review.

So, my original question is a request for an easy way to get a number about sentences pool. It’s very important to keep sentences pending to record.

daniel.abzakh · June 5, 2021, 8:27am

Where would I get the number of current recorded hours?