Discussion: Relaxation of the 10 sec recording limitation

We had a side conversation about this a while back in this topic.

Problem definition:

  • Common Voice recordings are limited to 10 seconds. This was mainly based on Mozilla’s Deepspeech work which was done 5+ years before, and mainly because of the VRAM limits at that time (8 GB).
  • This limit also affects the text-corpus (default 14 words / ~100-110 char sentences, which can be changed with locale-specific rules - but because of the recording limit it is not wise to increase them much). Common Voice mainly relies on CC-0/Public domain sentences, so many sentences coming from public domain works got stripped out, or we had to do the tedious work of dividing sentences manually from sub-sentences to get the vocabulary needed.
  • Deepspeech is no more, and so does the follower Coqui-STT. Current state-of-the-art points towards whisper or similar models (which can process larger chunks).

Current Status:

  • I really searched and could not find exact market values for the latest graphics cards, and existing ones are from gamer’s perspective. But the current optimum for ML work in the wild is rtx-3090/3090-ti and rtx-4090 series, which have 24 GB RAM. Or more dedicated ones with 40/48 or more VRAM.
  • So, from the HW perspective, the capacity at least doubled, more like tripled. So that would allow us to work on longer audio, keeping the batch sizes the same.
  • Whisper has a 30-sec limitation per inference, which is exactly 3x of the current 10-sec limit.

So, I propose thinking about increasing this limit to 20 sec (for 16 GB VRAM) and opening it to the discussion.

Upsides:

  • More text-corpus data, possibly more vocabulary, more domain-specific data (many of the languages already covered everyday speech which tends to be short, 3-5 words).
  • Possibly more volunteers / better volunteer retention as they get more interesting sentences.
  • More voice-corpus
  • Better data for state-of-the-art models
  • Better models

Downsides:

  • My experience with volunteers shows that they are happy when they get shorter sentences - in a shorter time they go to the next one.
  • Longer sentences can mean more errors, not enough breath, more re-recording the same sentence etc.
  • That would increase the data size to be processed.
  • That would need some changes in the code, both web/server and offline work.
  • That would also effect user-side code, if not written parametric, it can take some time to adapt.

So, what do you think?
Ref: @ftyers, @kathyreid, anybody.

Strong agree with this proposal, for the following reasons:

  • A 20s time limit will allow for more “natural”-sounding speech. Although it will still be read speech, in contrast to spontaneous speech, a 20s time limit will allow for greater intonation variance and variety of speech.

  • A 20s time limit allows for more flexibility in written text sentence generation, and allows for more variation in sentence / grammatical structure.

I would like more information on:

  • Is there any impact on force aligners from longer audio length?
  • Have we tested the compute impact?
  • How do common ASR algorithms limit or constrain audio length - DeepSpeech is all but abandoned, but how does say Coqui STT or the NVIDIA algorithms constrain audio length?
  • Will this impact some languages more than others? I wonder if this would benefit agglutinative languages, which can have very long word constructions, favourably? I’m thinking of say German or Kiswahili, where because of agglutination, you end up with very long words in sentences. I’m not strong enough in linguistics to know which groups of languages are more agglutinative, but I bet @Francis_Tyers knows :laughing:. So, there might be an argument here on the basis of equity and diversity - if under-represented or marginalised languages are better served by longer audio length.
2 Likes

Thank you for more insight into the topic @kathyreid. Important points indeed.

Coqui STT

Unfortunately, it is also abandoned last month. Therefore I moved to other possibilities, initially whisper for now… On Deepspeech (i.e. Coqui STT), I could work with 128/256 batch sizes in most cases with a 16 GB VRAM, if fails, you just have to drop the batch size. So, when you increase the duration, you should halve the batch size for the same GPU. I never worked with longer-than-10-secs but I know people working with 20-25 secs with batch sizes like 32 or so, which is mostly enough - but takes more time (up to 30% more in my experiments with Coqui STT). I did not see important changes in model accuracy in these ranges with Coqui STT in my experiments.

There is always the possibility of dividing the audio of course.

AFAIK, the recommended audio duration for nVidia’s Jasper/Quartznet is 5-25 sec. It is not a hard limit, but I read that more than 20 sec will start to “hit the GPU memory”. On whisper, 30 sec. I’m new to whisper, but I read about some problems during inference with longer audio, where it tries to enforce the sentence pattern from the training set, so people tend to divide the audio into shorter batches - but that is inference. I do not have enough knowledge for fine-tuning yet, I will have it in a couple of weeks - “more work needed :slight_smile:”.

We could try to custom-split a single dataset as experiments where the total training audio is the same but the length of sentences is in different ranges. But there are many other factors that would affect the experiment. I thought about that before, but left it aside for this reason.

More info in these areas is needed of course. Anybody with knowledge, please chime in.

agglutinative languages

Turkish is one of them, nearly the worst among them, where we cannot use classic n-grams in LM’s for a general-purpose application, it easily becomes huge.

But, as we know, it is mainly the total training audio duration that matters, not only the number of sentences. So, instead of 10.000 5 sec average sentences, if we get longer ones making the average 10 sec, it will provide much better results. So marginalized languages can more easily collect text-corpus and create a better model as there will be longer audio - I think.

Another problem with long audio wrt training will be the distribution. If you have around 5 sec audio and there is a single one 15 sec and you use large batch sizes which are good for 5 -6 secs, at some point it will crash the training. This has been the reason for the hard limit. It will be an annoyance at the start, but it will equalize in time. I can think of several methods to ease that, e.g. not including the longer ones as we did until now, or training with shorter ones (10 sec) and fine-tuning with longer ones, etc.

In any way, my main concern is UX - i.e. people getting tired of reading long sentences.

1 Like

FWIW, could the next ‘evolution’ of Common Voice be a collaboration or overlap with the LibriVox project? That could solve the problem of longer clips whilst providing interesting sentences to volunteers.