Discussion: Best practices/steps for increased recording duration?

As you know, the recording duration is increased from 10s to 15s. As indicated here, this was only the first step in the transition. Only the recording limit is increased, but as we know the recording duration is a function of sentence length (and the reading speed of the volunteers). So, more steps should be taken for us to be able to get those longer recordings.

I’m opening this discussion to all communities, so that we can think of some best practices we can adapt. But first some info/reminders…

Sources of text-corpora

We currently have three sources for text-corpora:

  1. Web interface (write page)
  2. Bulk submissions through the write page (new process with additional steps)
  3. Wikipedia fair-use through cv-sentence-extractor (max 3 sentences per article)

Validation

  1. AFAIK, the first two are handled by rules set in the CV repo:
    https://github.com/common-voice/common-voice/tree/main/server/src/core/sentences/validation/languages
  2. The third has its own rule sets:
    https://github.com/common-voice/cv-sentence-extractor/tree/main/src/rules

In many cases, there is a max_words, sometimes max_characters (e.g. it), and some languages might have more complex measures. These limits have usually been set by the communities by analyzing a subset of recordings for character_speed, average number of milli-seconds to speak each character, also giving some slack for slow speakers…

There are also minimums on these rules, mostly set to 1 word, but it is also advisable to have longer recordings (10-15 sec) with the new architectures, although shorter ones are very much valid, e.g. in conversations (like yes, no).

These rules did not change yet, so you won’t record longer sentences, except very slow reading and/or long pauses.

A Suggestion for Changes in Validation

A quick way of changing the sentence length would be to increase the max words/characters by 50%, e.g. the default maximum is 14 words, and it will become 21…

The above step might work for many cases. But I find it a bit “quick-and-dirty” (or I’m a picky engineer :slight_smile:). The reading speed might change with sentence length (among other things, like age). I expect many people will read slower, give more pauses to be able to read-ahead etc. Maybe a slightly lower value will be more adequate (e.g. 19-20 instead of 21).

I think the optimal solution would be for communities to revisit/rethink their rules, re-sample the latest dataset for longer recordings/sentences and calculate a more exact char_speed (or better a distribution) to decide on these max values.

These are just my first thoughts on this topic and I want to hear your ideas. There are much more experienced people then me here.

1 Like

This might probably help in this respect… I ran a round-up test for languages both in CV and in Whisper to measure Whisper’s multi-lingual model performance, to get a baseline before fine-tuning.

  • I chose max. 100 unique longest sentences from validated.tsv
  • I chose unique voices (client_id’s) for sentences
  • The sentence length is taken as it is (no normalization, without removing punctuation etc). Durations are taken from clip_durations.tsv file (which includes silences at the start/end).
  • No demographic info is taken into account (e.g. age => slow?)

Except for two locales, I could get 100 recordings to analyze. The following is from a part of that analysis, which includes average character speed for those samples.

This is not ideal, might have outliers (e.g. the sr locale) , but might be a quick start…

lc	num_sentences	duration(sec)	avg_char_speed(msec)
am	15	93.8	117.54
ar	100	731.48	137.56
as	100	767.98	97.46
az	94	501.77	111.77
ba	100	831.14	104.03
be	100	935.51	80.91
bg	100	861.81	87.94
bn	100	933.37	67.78
br	100	450.68	140.28
ca	100	914.14	71.83
cs	100	667.27	90
cy	100	794.25	100.78
da	100	530.09	102
de	100	893.32	72.35
el	100	598.47	108.93
en	100	925.36	79.67
es	100	865.19	81.16
et	100	969.84	55.07
eu	100	780.02	94.15
fa	100	858.47	98.75
fi	100	658.88	100.55
fr	100	842.27	76.29
gl	100	754.5	87.4
ha	100	694.24	90.75
hi	100	723.11	119.01
hu	100	856.03	87.45
hy-AM	100	857.76	87.07
id	100	592.46	107.85
it	100	886.27	77.01
ka	100	858.7	85.81
kk	100	701.58	103.77
lt	100	721.92	99.74
lv	100	732.39	103.14
mk	100	583.72	91.78
ml	100	924.56	73.77
mn	100	763.89	86.97
mr	100	952.79	85.39
mt	100	621.23	107.26
nl	100	756.38	80.15
nn-NO	100	633.68	86.39
oc	100	691.97	103.55
pa-IN	100	673.98	127.01
pl	100	817.1	81.81
pt	100	841.74	94.48
ro	100	533.6	104.87
ru	100	863	73.75
sk	100	588.35	146.7
sl	100	499.21	137.03
sq	100	657.91	88.03
sr	100	381.33	257.5
sv-SE	100	722.74	96.05
sw	100	785.81	89.22
ta	100	843.61	86.77
th	100	870.65	101.15
tk	100	834.82	85.8
tr	100	741.63	93.22
tt	100	602.02	124.38
uk	100	832.73	97.17
ur	100	653.64	110.44
uz	100	824.75	89.96
vi	100	486.87	108.5
yo	100	712.99	88.05

In general, the top/bottom averages can differ by a factor of two, averaging about 0.1 sec/char. This complies with 10 sec <=> 100 chars on the average - without leaving some slack for slower speakers.

1 Like