I rented a server and am training models on demand for the rest of the month. I’ve done Chuvash (cv) and Tatar (tt) so far: https://tepozcatl.omnilingo.cc/
Here is a list of the rest of the language codes of the languages I plan to train: et, ta, tr, ky, dv, mn, id, br, mt, el, th, rm-sursilv, ro, hu, ia, sl, lv, lg, or, sah, cnh, ga-IE, ja, lt, rm-vallader, ka, hsb, pa-IN, vi, fi, ab, as, hi, vot.
If you would like a trained model for your language that is in this list, find me on Matrix #Common Voice and let me know and I’ll bump it up the priority list.
I think they should do it after every language with more than one hour! They take hardly any time to train (about 30 minutes on an old GPU). The stats are online here: https://tepozcatl.omnilingo.cc/manifest.html
An interesting thing about this graph is that you can see orthographic effects
As regards WER, the systems are not really usable at the moment, but they could easily be fine-tuned for e.g. closed-vocabulary tasks (I provide the checkpoints and the alphabet). What is shocking is that for some the CER gets close to around 20, which means that potentially they could be useful in applications like indexing/audio search. There are probably a lot of useful things that can be done with less-than perfect ASR…
The following models are due to be trained: rm-sursilv , ia , ja, vi, rm-vallader.
This is very neat and impressive data–nice work! Could you clarify the difference between the two CER/WER columns? I’m just getting into this this sort of stuff and am not sure how to interpret that.
The first column is just the acoustic model, the second column is the acoustic model with an external KenLM-based scorer. That’s actually on the page here, I just couldn’t fit it in the screenshot
Are you interested in any language in particular? I am happy to give more details.
I would be interested to hear what type of impact the differences between PT-BR and PT-PT have. I’m not a speaker but I’m under the impression that one can sometimes be difficult to understand for speakers of the other. Do you know whether the training clips are primarily one or the other, or whether there is a difference in results between them?
I imagine that the training clips are split approximately according to population ratio. So 9 Brazilian clips for 1 non-Brazilian clip. That’s my impression after listening to 10 clips on the Common Voice site too.
A speaker of Brazilian Portuguese tested the system and said it is “better than expected” but that “a lot of heavy lifting is done by the language model”. No non-Brazilian speaker has tested it.
I find it difficult to understand Portuguese from Portugal, but then I haven’t been exposed to it 10% of the time I have been exposed to Portuguese, probably more like 1%.
If you are interested in variants and relative performance of the models, then Irish and Finnish would also be quite interesting Irish, in terms of dialectal differences and Finnish in terms of the difference between colloquial and written.
Very interesting! Good suggestions regarding Irish and Finnish–I had no idea that there existed a difference of the sort you describe in Finnish. I will certainly investigate that.
The Abkhazian language has about an hour of recordings, not much but would be interesting to have an STT and check it’s quality.
I’m working on machine translation for Abkhazian, some of the techniques I’m using is back translation, a monolingual corpus is very useful in this case.
Will it be useful for you to have voices without text? Or text without voice, I have about 20 hours of audio, and a lot of text.
Also do you do some sort of word tokenization? That pumps the scoring up for machine translation.
I’m assuming it should do the same for STT.
@ftyers Francis, thank you for doing this. As I’m completely nOOb in this area, so I need to ask you some questions (I think these are non-technical and will not be in your report):
What should the CER/WER values be for a practical application?
What are the reasons for higher error rates? Model or data?
How can we make these values better (except the obvious ones like more sentences/recordings)?