Let's talk about OpenAI Whisper and about colaborations with STT systems in general

When it comes to practically usable STT systems, the Common Voice Datasets are mainly used in the systems of Vosk STT and Coqui STT (the spinn-off company from the old Deepspeech team). This is nice, but it looks like the big players are rarely using the datasets, or at least not in the way we thought they would.

The latest open source system that doesn’t use Common Voice Datasets for training is Whisper from OpenAI. They do use the datasets for QA and testing, but not for training (according to this post on GitHub) This is a little sad, since some of the biggest languages of Common Voice are not supported by the system, and we give them a way to add them. Whisper is the only modern model from OpenAI that is still open source and it is quite impressive from a technical viewpoint. It would be great if we could help to expand it over time.

Do you think there is a chance to start collaborations between Mozilla and Whisper or one of the other STT systems? I believe it would be highly motivating for a language community if they automatically would get a usable STT model once they reach - say 1000 hours.

In my experience, the people who organize donation events and who collect sentences for their language are often not technical educated enough to train their own model. Some sort of assistance or a coloration with STT companies would really improve this project, because right now many communities like Kinyarwanda, Esperanto or Belarusian really organized impressive campaigns for their size, but creating a usable STT systems afterward is slow and hard to achieve for a small community without much knowledge about machine learning.

Five years ago many people thought that a language would be integrated into the big systems from Google, Apple and so on by the moment enough data is available under a free license. Today this looks naive. Maybe we need official partnerships with STT systems to push this forward. I belive Mozilla and NVIDIA are big enough for an initiative like this, and this could really help small languages and also revive Common Voice.

What do you think?


it’s possible to fine tune whisper models with Common Voice data so that it can work better for your language. See https://huggingface.co/blog/fine-tune-whisper . I believe Mozilla and HuggingFace have a collaboration in place for including CV in the Datasets hub e.g https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0

We’re using this dataset to fine-tune tiny, small and medium whisper models for Welsh (/+English) but have observed (so far) that WERs are not as good as those we get from wav2vec2. In addition, KenLM language model can be used with wav2vec2 to reduce the WERs again. So whisper might not be the best option. You can see our models here https://huggingface.co/techiaith



Thanks for your links, I will read them and eperiment with this.

I tried to train models several times in the last years, but without deeper knowledge, just following instructions, the results are often very poor and I failed several times despite spending many weekends trying. My best result for Esperanto was 25% WER using Deepspeech. The Vosk model trained on the same data by someone competent is on 7% WER. With this model we are now able to create Android apps and progress further, but I wasn’t able to do this myself.

There is no out of the box solution, where you pay for the server time, adapt the language code, define the alphabet and click on “start training”. You alway have to first learn about parameters, learn how to use Colab and Hugging Face, learn how countless libraries work, learn how to create a scorer and much more.

I have some basic knowlege about scripting and I struggle with it. I don’t think that the average linguist or activist who builds up a dataset for a minority language really has a chance to create a model, and making the model usefull with an app afterwards is a completely different story.

In my opinion giving more assistance here is fundamental for the success of this project. We need usable results, and by that I don’t mean a demo on huggingface, but usable apps, voice keyboards, subtilte creators and voice assistants.

Maybe the new company https://mozilla.ai will improve the situation. I am really curious what they are up to.

1 Like

Yes. Mozilla doesn’t seem to be trying hard enough. But this dataset will be useful in the future, someday someone will make good use of it.

@maria2, the datasets are already useful. Many SotA models use CV in their production, this way or another. For example Whisper uses it for evaluation, as their base train/dev data are large enough and the CV datasets have diverse voices (hopefully).

And I could reach ~5% WER by finetuning Whisper multilingual medium model with only using CV Turkish v14.0 dataset (custom splits). It has about 100 hours of validated data.


Really?? That’s very nice!

1 Like

Thats great! How have you done the fine tuning? If you used colab, huggingface or a similar cloud service, would you mind sharing your training notebook?

@stergro, Huggingface, run locally. Huggingface / Datasets seems to make it easy if you use the defaults. On the other hand, if you divert from what is being provided (e.g. custom splits), it becomes harder - thus messy, cannot be a simple notebook I’m afraid.

It is meant to support multiple languages, dataset sources, and splitting algorithms to benchmark and it became more messy. It will not work if the language is not in the original 99 languages Whisper is trained on, you need to write tokenizers and such for them, but don’t expect good results.

It is not ready for any kind of open-source release. But I have it on Github private, I can invite you there if you drop a PM.


It seems to be over for many startups/projects.
Did you watch GPT-4o demos and/or reviews?