How many strings/sentences are pending for recording by language?
This info is useful to made estimations for searching or generating texts and phrases before finishing recording queue is drained.
Right now we donât have an automated way to show this, and usually what we do is estimation based on the current number of sentences and the current number of hours recorded, with an average of 5s per sentence.
Buffer of hours = (Number of current sentences * 5) / 3600
Hours left = Buffer of hours - Current recorded hours
Sentences left = (Hours left * 3600)/5
So for example, for Catalan there are 460481 sentences and 493 hrs recorded.
(460481 * 5) / 3600 = 639,55 hrs
639,55 - 493 = 146,55 hrs left
(146,55 * 3600)/5 = 105516 sentences left
How many times a sentence would be recorded?
Ideally it should be recorded just once, which has demonstrated to result into better quality on model training.
We donât have that limitation built-in on the site yet, but thatâs something we want to have at some point.
Thanks for the explanation.
As the system doesnât enforce this limitation, would âSentences leftâ in the above calculation underestimated? Or the figure âCurrent recorded hoursâ considers only 1 clip (among multiple clips by whatever criteria) per sentence?
Current recorded hours considers all clips.
Doing the math you can easily see also languages that already have recorded more hours than sentences available, which will signal how many repetitions there are (instead of sentences left).
The CV UI shows the number of clips of a language recorded by all contributors:
Can we sum all the figures here to get an accurate number of clips, instead of estimating it from the recorded hours?
Itâs quite the other way around we estimate the number of hours from the number of clips we have. Improvements to the dashboard can be made, but they need to be first analyzed and prioritized.
If you have a specific problem/need it would be interesting to explore first it, and then see possible solutions.
Sorry for the confusion. I am not asking to update the UI. I am figuring out a way to answer the original question:
How many strings/sentences are pending for recording by language?
You replied that we firstly need to know the number of sentences recorded and this can be estimated from the hours of clips. This is just an estimation and the possibility of sentences be recorded multiple times makes it even inaccurate. I am thinking another solution is to sum the figures shown in my screen shot, given that we have the api to retrieve those figures. No, I am not asking to update the UI. You tell us the api, if available, and we do the maths ourselves.
I am very interested in this question as I suspect my language (zh-HK) has already run out of sentence for recording.
Common Voice has 3 roles. 2 roles are clear (speaker and listener), but one role is silent: sentence submitter.
Without sentences to record, there arenât clips to review.
So, my original question is a request for an easy way to get a number about sentences pool. Itâs very important to keep sentences pending to record.
Where would I get the number of current recorded hours?