As you know, the recording duration is increased from 10s to 15s. As indicated here, this was only the first step in the transition. Only the recording limit is increased, but as we know the recording duration is a function of sentence length (and the reading speed of the volunteers). So, more steps should be taken for us to be able to get those longer recordings.
I’m opening this discussion to all communities, so that we can think of some best practices we can adapt. But first some info/reminders…
Sources of text-corpora
We currently have three sources for text-corpora:
- Web interface (write page)
- Bulk submissions through the write page (new process with additional steps)
- Wikipedia fair-use through cv-sentence-extractor (max 3 sentences per article)
Validation
- AFAIK, the first two are handled by rules set in the CV repo:
https://github.com/common-voice/common-voice/tree/main/server/src/core/sentences/validation/languages - The third has its own rule sets:
https://github.com/common-voice/cv-sentence-extractor/tree/main/src/rules
In many cases, there is a max_words
, sometimes max_characters
(e.g. it
), and some languages might have more complex measures. These limits have usually been set by the communities by analyzing a subset of recordings for character_speed
, average number of milli-seconds to speak each character, also giving some slack for slow speakers…
There are also minimums on these rules, mostly set to 1 word, but it is also advisable to have longer recordings (10-15 sec) with the new architectures, although shorter ones are very much valid, e.g. in conversations (like yes, no).
These rules did not change yet, so you won’t record longer sentences, except very slow reading and/or long pauses.
A Suggestion for Changes in Validation
A quick way of changing the sentence length would be to increase the max words/characters by 50%, e.g. the default maximum is 14 words, and it will become 21…
The above step might work for many cases. But I find it a bit “quick-and-dirty” (or I’m a picky engineer ). The reading speed might change with sentence length (among other things, like age). I expect many people will read slower, give more pauses to be able to read-ahead etc. Maybe a slightly lower value will be more adequate (e.g. 19-20 instead of 21).
I think the optimal solution would be for communities to revisit/rethink their rules, re-sample the latest dataset for longer recordings/sentences and calculate a more exact char_speed (or better a distribution) to decide on these max values.
These are just my first thoughts on this topic and I want to hear your ideas. There are much more experienced people then me here.