Hello everyone
We would like to gather feedback on a situation that we have been monitoring for some time.
When Common Voice started, industry benchmarks suggested that around 5000 hours of ASR training data might be required for training a robust STT model that could be deployed in a product like a voice assistant. Over the years, more foundational multilingual models, and more methodologies for fine-tuning models to new linguistic contexts, have emerged, and many communities are deciding not to train from scratch, but to build on these technologies.
In these contexts, communities may only be aiming to generate small datasets, for example 50 hours. We are also increasingly taking language requests from Band A languages (small speaker populations, low resource contexts etc). In this context, the current localisation burden of getting started on Common Voice (~1500 strings on Pontoon!) may seem particularly oversized.
We want to better support a range of different data collection modalities. In light of this, we are considering overhauling our localisation approach to enabling a new language on Common Voice.
In particular, reducing localisation requirements for go-live to only include the text on the core contribute UIs (Speak, Listen, Write), the Download modal and the legal documents (privacy, terms). This would not prevent communities from localising further, but it would mean that communities could start collecting data sooner.
We welcome your input on this proposal.
Looking forward to your great inputs.
Thanks
CV Team