We had a side conversation about this a while back in this topic.
- Common Voice recordings are limited to 10 seconds. This was mainly based on Mozilla’s Deepspeech work which was done 5+ years before, and mainly because of the VRAM limits at that time (8 GB).
- This limit also affects the text-corpus (default 14 words / ~100-110 char sentences, which can be changed with locale-specific rules - but because of the recording limit it is not wise to increase them much). Common Voice mainly relies on CC-0/Public domain sentences, so many sentences coming from public domain works got stripped out, or we had to do the tedious work of dividing sentences manually from sub-sentences to get the vocabulary needed.
- Deepspeech is no more, and so does the follower Coqui-STT. Current state-of-the-art points towards whisper or similar models (which can process larger chunks).
- I really searched and could not find exact market values for the latest graphics cards, and existing ones are from gamer’s perspective. But the current optimum for ML work in the wild is rtx-3090/3090-ti and rtx-4090 series, which have 24 GB RAM. Or more dedicated ones with 40/48 or more VRAM.
- So, from the HW perspective, the capacity at least doubled, more like tripled. So that would allow us to work on longer audio, keeping the batch sizes the same.
- Whisper has a 30-sec limitation per inference, which is exactly 3x of the current 10-sec limit.
So, I propose thinking about increasing this limit to 20 sec (for 16 GB VRAM) and opening it to the discussion.
- More text-corpus data, possibly more vocabulary, more domain-specific data (many of the languages already covered everyday speech which tends to be short, 3-5 words).
- Possibly more volunteers / better volunteer retention as they get more interesting sentences.
- More voice-corpus
- Better data for state-of-the-art models
- Better models
- My experience with volunteers shows that they are happy when they get shorter sentences - in a shorter time they go to the next one.
- Longer sentences can mean more errors, not enough breath, more re-recording the same sentence etc.
- That would increase the data size to be processed.
- That would need some changes in the code, both web/server and offline work.
- That would also effect user-side code, if not written parametric, it can take some time to adapt.
So, what do you think?
Ref: @ftyers, @kathyreid, anybody.