Model training on demand

Hi everyone,

I rented a server and am training models on demand for the rest of the month. I’ve done Chuvash (cv) and Tatar (tt) so far:

Here is a list of the rest of the language codes of the languages I plan to train: et, ta, tr, ky, dv, mn, id, br, mt, el, th, rm-sursilv, ro, hu, ia, sl, lv, lg, or, sah, cnh, ga-IE, ja, lt, rm-vallader, ka, hsb, pa-IN, vi, fi, ab, as, hi, vot.

If you would like a trained model for your language that is in this list, find me on Matrix #Common Voice and let me know and I’ll bump it up the priority list. :slight_smile:


The following models are now available: pt, et , ta , tr , ky , dv , mn , br , mt , lg , sah , cnh , ga-IE , ka , hsb, fi

The following models could unfortunately not be trained due to lack of data: ab , as , hi , vot.

The following models are due to be trained: id , el, th , rm-sursilv , ro , hu , ia , sl , lv , or , ja , ka , pa-IN , vi.

Great initiative, thanks for your work! Mozilla should do this after every dataset release for every language with more than 300 hours or so.

Would you mind sharing the statistics a little? Like the WER for every model.

I think they should do it after every language with more than one hour! They take hardly any time to train (about 30 minutes on an old GPU). The stats are online here:

An interesting thing about this graph is that you can see orthographic effects :slight_smile:

As regards WER, the systems are not really usable at the moment, but they could easily be fine-tuned for e.g. closed-vocabulary tasks (I provide the checkpoints and the alphabet). What is shocking is that for some the CER gets close to around 20, which means that potentially they could be useful in applications like indexing/audio search. There are probably a lot of useful things that can be done with less-than perfect ASR…

The following models are due to be trained: rm-sursilv , ia , ja, vi, rm-vallader.

1 Like

If you plot the CER against seconds per character in the alphabet, then the effect is more clear:

1 Like

This is awesome, I wanna try some models as soon as possible.

1 Like

You can download them right now from here. :slight_smile: The Portuguese is one of the better ones. Not ready for use maybe, but ready for further finetuning.

I have uploaded a new one for Portuguese that is even better:

I’m still working on it, but will update here.

Here are the final results:

And some preliminary results for the target segments corpus:


This is very neat and impressive data–nice work! Could you clarify the difference between the two CER/WER columns? I’m just getting into this this sort of stuff and am not sure how to interpret that.

1 Like

The first column is just the acoustic model, the second column is the acoustic model with an external KenLM-based scorer. That’s actually on the page here, I just couldn’t fit it in the screenshot :smiley:

Are you interested in any language in particular? I am happy to give more details. :slight_smile:

1 Like

Ah, I see. Thanks for the link.

I would be interested to hear what type of impact the differences between PT-BR and PT-PT have. I’m not a speaker but I’m under the impression that one can sometimes be difficult to understand for speakers of the other. Do you know whether the training clips are primarily one or the other, or whether there is a difference in results between them?

I imagine that the training clips are split approximately according to population ratio. So 9 Brazilian clips for 1 non-Brazilian clip. That’s my impression after listening to 10 clips on the Common Voice site too.

A speaker of Brazilian Portuguese tested the system and said it is “better than expected” but that “a lot of heavy lifting is done by the language model”. No non-Brazilian speaker has tested it.

I find it difficult to understand Portuguese from Portugal, but then I haven’t been exposed to it 10% of the time I have been exposed to Portuguese, probably more like 1%.

If you are interested in variants and relative performance of the models, then Irish and Finnish would also be quite interesting Irish, in terms of dialectal differences and Finnish in terms of the difference between colloquial and written.

1 Like

Very interesting! Good suggestions regarding Irish and Finnish–I had no idea that there existed a difference of the sort you describe in Finnish. I will certainly investigate that.

You can check out Colloquial Finnish on Wikipedia. :slight_smile:

1 Like

The Abkhazian language has about an hour of recordings, not much but would be interesting to have an STT and check it’s quality.

I’m working on machine translation for Abkhazian, some of the techniques I’m using is back translation, a monolingual corpus is very useful in this case.
Will it be useful for you to have voices without text? Or text without voice, I have about 20 hours of audio, and a lot of text.
Also do you do some sort of word tokenization? That pumps the scoring up for machine translation.
I’m assuming it should do the same for STT.

1 Like

Because of the way that the splitting works between train-dev-test there wasn’t enough data to be able to train the system.

fran@tepozcatl:~/cv-corpus-6.1-2020-12-11/ab$ wc -l train.tsv test.tsv dev.tsv 
  23 train.tsv
  10 test.tsv
   1 dev.tsv
  34 total

If you have more data I’d be happy to give it a shot. You can find me on the Common Voice channel on Matrix.

1 Like

Okay, talk to you soon.

1 Like

The server rental now expired. The final results were:

I’ll be submitting a technical report to ArXiv soon with details. Thanks to everyone who commented.

1 Like

@ftyers Francis, thank you for doing this. As I’m completely nOOb in this area, so I need to ask you some questions (I think these are non-technical and will not be in your report):

  • What should the CER/WER values be for a practical application?
  • What are the reasons for higher error rates? Model or data?
  • How can we make these values better (except the obvious ones like more sentences/recordings)?