Hi @daniel.abzakh, this is the method I’ve been using:
- Calculate average char duration from the latest dataset metadata and text-corpus.
- While adding new text corpus try to mix long sentences with shorter ones, mix them randomly and calculate the expected duration. I tried not to deviate too much.
- If they are different keep in mind that would result in x% change from shown.
Not ideal, but using this method I could keep the difference from the predicted duration under 1 hour (dropped from 65.x hours validated).
I think the real problem the CV engineers will be facing is the CPU resources and disk bandwidth required while calculating the real duration from mp3 files in bulk. I don’t know how they are doing now, but they must be running on an offline copy…
I see they are moving to 3 monthly releases, that might help.
Also, your dataset was rather small in v7.0, as it is larger now, any difference will have less effect on the total.