Model training on demand

ftyers · April 3, 2021, 2:02am

Hi everyone,

I rented a server and am training models on demand for the rest of the month. I’ve done Chuvash (cv) and Tatar (tt) so far: https://tepozcatl.omnilingo.cc/

Here is a list of the rest of the language codes of the languages I plan to train: et, ta, tr, ky, dv, mn, id, br, mt, el, th, rm-sursilv, ro, hu, ia, sl, lv, lg, or, sah, cnh, ga-IE, ja, lt, rm-vallader, ka, hsb, pa-IN, vi, fi, ab, as, hi, vot.

If you would like a trained model for your language that is in this list, find me on Matrix #Common Voice and let me know and I’ll bump it up the priority list.

ftyers · April 6, 2021, 2:02am

The following models are now available: pt, et , ta , tr , ky , dv , mn , br , mt , lg , sah , cnh , ga-IE , ka , hsb, fi

The following models could unfortunately not be trained due to lack of data: ab , as , hi , vot.

The following models are due to be trained: id , el, th , rm-sursilv , ro , hu , ia , sl , lv , or , ja , ka , pa-IN , vi.

stergro · April 7, 2021, 1:04pm

Great initiative, thanks for your work! Mozilla should do this after every dataset release for every language with more than 300 hours or so.

Would you mind sharing the statistics a little? Like the WER for every model.

ftyers · April 7, 2021, 1:22pm

I think they should do it after every language with more than one hour! They take hardly any time to train (about 30 minutes on an old GPU). The stats are online here: https://tepozcatl.omnilingo.cc/manifest.html

An interesting thing about this graph is that you can see orthographic effects

As regards WER, the systems are not really usable at the moment, but they could easily be fine-tuned for e.g. closed-vocabulary tasks (I provide the checkpoints and the alphabet). What is shocking is that for some the CER gets close to around 20, which means that potentially they could be useful in applications like indexing/audio search. There are probably a lot of useful things that can be done with less-than perfect ASR…

The following models are due to be trained: rm-sursilv , ia , ja, vi, rm-vallader.

ftyers · April 7, 2021, 8:20pm

If you plot the CER against seconds per character in the alphabet, then the effect is more clear:

Codigo_Logo_Programacao_e_Inteligencia_Artificial · April 8, 2021, 1:30pm

This is awesome, I wanna try some models as soon as possible.

ftyers · April 8, 2021, 3:00pm

You can download them right now from here. The Portuguese is one of the better ones. Not ready for use maybe, but ready for further finetuning.

ftyers · April 14, 2021, 4:01pm

I have uploaded a new one for Portuguese that is even better:

I’m still working on it, but will update here.

ftyers · April 24, 2021, 4:17pm

Here are the final results:

And some preliminary results for the target segments corpus:

Matthew_Siciliano · April 24, 2021, 5:06pm

This is very neat and impressive data–nice work! Could you clarify the difference between the two CER/WER columns? I’m just getting into this this sort of stuff and am not sure how to interpret that.

ftyers · April 24, 2021, 5:11pm

The first column is just the acoustic model, the second column is the acoustic model with an external KenLM-based scorer. That’s actually on the page here, I just couldn’t fit it in the screenshot

Are you interested in any language in particular? I am happy to give more details.

Matthew_Siciliano · April 24, 2021, 5:27pm

Ah, I see. Thanks for the link.

I would be interested to hear what type of impact the differences between PT-BR and PT-PT have. I’m not a speaker but I’m under the impression that one can sometimes be difficult to understand for speakers of the other. Do you know whether the training clips are primarily one or the other, or whether there is a difference in results between them?

ftyers · April 24, 2021, 5:43pm

I imagine that the training clips are split approximately according to population ratio. So 9 Brazilian clips for 1 non-Brazilian clip. That’s my impression after listening to 10 clips on the Common Voice site too.

A speaker of Brazilian Portuguese tested the system and said it is “better than expected” but that “a lot of heavy lifting is done by the language model”. No non-Brazilian speaker has tested it.

I find it difficult to understand Portuguese from Portugal, but then I haven’t been exposed to it 10% of the time I have been exposed to Portuguese, probably more like 1%.

If you are interested in variants and relative performance of the models, then Irish and Finnish would also be quite interesting Irish, in terms of dialectal differences and Finnish in terms of the difference between colloquial and written.

Matthew_Siciliano · April 24, 2021, 5:48pm

Very interesting! Good suggestions regarding Irish and Finnish–I had no idea that there existed a difference of the sort you describe in Finnish. I will certainly investigate that.

ftyers · April 24, 2021, 6:06pm

You can check out Colloquial Finnish on Wikipedia.

daniel.abzakh · May 2, 2021, 7:13pm

The Abkhazian language has about an hour of recordings, not much but would be interesting to have an STT and check it’s quality.

I’m working on machine translation for Abkhazian, some of the techniques I’m using is back translation, a monolingual corpus is very useful in this case.
Will it be useful for you to have voices without text? Or text without voice, I have about 20 hours of audio, and a lot of text.
Also do you do some sort of word tokenization? That pumps the scoring up for machine translation.
I’m assuming it should do the same for STT.

ftyers · May 3, 2021, 12:37am

Because of the way that the splitting works between train-dev-test there wasn’t enough data to be able to train the system.

fran@tepozcatl:~/cv-corpus-6.1-2020-12-11/ab$ wc -l train.tsv test.tsv dev.tsv 
  23 train.tsv
  10 test.tsv
   1 dev.tsv
  34 total

If you have more data I’d be happy to give it a shot. You can find me on the Common Voice channel on Matrix.

daniel.abzakh · May 3, 2021, 4:10am

Okay, talk to you soon.

ftyers · May 4, 2021, 5:17pm

The server rental now expired. The final results were:

I’ll be submitting a technical report to ArXiv soon with details. Thanks to everyone who commented.

bozden · May 5, 2021, 12:26pm

@ftyers Francis, thank you for doing this. As I’m completely nOOb in this area, so I need to ask you some questions (I think these are non-technical and will not be in your report):

What should the CER/WER values be for a practical application?
What are the reasons for higher error rates? Model or data?
How can we make these values better (except the obvious ones like more sentences/recordings)?

Topic		Replies	Views
Nuevo conjunto de datos de mitad de año: ¡Más datos, más idiomas! Español (es)	10	2145	July 6, 2019
Timeline for releasing the DeepSpeech models trained with the Common Voice data Common Voice dataset	1	1338	June 23, 2018
Fine-tuning DeepSpeech Model (CommonVoice-DATA) DeepSpeech	60	6175	August 20, 2019
What is WER/CER of DeepSpeech v0.7.1 (or any other models) on Common Voice English DeepSpeech	5	2713	August 12, 2020
Does the pre-trained model use CommonVoice data? DeepSpeech	4	714	October 29, 2018

Model training on demand

Related topics