When it comes to practically usable STT systems, the Common Voice Datasets are mainly used in the systems of Vosk STT and Coqui STT (the spinn-off company from the old Deepspeech team). This is nice, but it looks like the big players are rarely using the datasets, or at least not in the way we thought they would.
The latest open source system that doesn’t use Common Voice Datasets for training is Whisper from OpenAI. They do use the datasets for QA and testing, but not for training (according to this post on GitHub) This is a little sad, since some of the biggest languages of Common Voice are not supported by the system, and we give them a way to add them. Whisper is the only modern model from OpenAI that is still open source and it is quite impressive from a technical viewpoint. It would be great if we could help to expand it over time.
Do you think there is a chance to start collaborations between Mozilla and Whisper or one of the other STT systems? I believe it would be highly motivating for a language community if they automatically would get a usable STT model once they reach - say 1000 hours.
In my experience, the people who organize donation events and who collect sentences for their language are often not technical educated enough to train their own model. Some sort of assistance or a coloration with STT companies would really improve this project, because right now many communities like Kinyarwanda, Esperanto or Belarusian really organized impressive campaigns for their size, but creating a usable STT systems afterward is slow and hard to achieve for a small community without much knowledge about machine learning.
Five years ago many people thought that a language would be integrated into the big systems from Google, Apple and so on by the moment enough data is available under a free license. Today this looks naive. Maybe we need official partnerships with STT systems to push this forward. I belive Mozilla and NVIDIA are big enough for an initiative like this, and this could really help small languages and also revive Common Voice.
What do you think?