Train for Portuguese that can even detect english proper nouns

rpratesh · November 18, 2019, 5:06am

Hi,

I have trained DeepSpeech to recognize Brazilian Portuguese. I have trained it with Brazilian proper nouns, music artists/songs/albums also and changed the language model accordingly. It’s working fine for now.

Now I want it to recognize the English proper nouns (and English music artists/songs/albums) .

One approach I tried is I’ve included those English words in Language model (with many variations in sentences). But still the detection is not good.
Second approach is that I’ve included English speech dataset in both train and dev, along with Portuguese datasets. This has slightly improved the English proper nouns detection, though at a huge cost of Portuguese detection.

Would like to know, if there are any more suggestions that I can try out, to achieve good enough Portuguese detection with ability to recognize the English proper nouns (with one acoustic model and one lang model).

rpratesh · November 20, 2019, 10:17am

Hi @lissyx @kdavis @reuben … any suggestions !

reuben · November 20, 2019, 10:23am

I would take a closer look at the distribution of English vs Portuguese in the training set to better match the distribution you’re aiming for (only a few English words). Also, if every sample in your training set is either completely in English or completely in Portuguese, that could be a problem, as you want the model to be able to handle code switching. So I would try to augment the training set by joining English and Portuguese samples to create some sort of artificial code switching.

rpratesh · November 20, 2019, 2:25pm

@reuben Thanks… when you mean “artificial code switching”, does it mean some decision that code (not the ASR model) takes based on input audio?..

Could you elaborate more on this

reuben · November 20, 2019, 2:29pm

rpratesh · November 20, 2019, 6:06pm

So, suppose I’ve a complete audio dataset of utterances with code-switch (like “tocar musicas de alan walker”) using a huge list of famous artists/albums/songs in Brazil (that includes both Portuguese and English names).

Do training with that dataset will be enough for DeepSpeech to handle code-switch?

Also, any suggestions on how to pick the final dataset by looking at my Portuguese, English and mixed dataset (that also includes code-switching)

reuben · November 20, 2019, 7:33pm

I can’t see the future.

I’ve already said how I would do this in my first message.

Paolo_Losi · November 21, 2019, 11:03am

@rpratesh one simple solution (as @reuben has suggested) is to change the weight of portuguese audio samples with respect to english ones. I guess that english audio samples outnumber the portuguese ones. One simple way to increase the weight of portugues is to concatenate multiple copies of the portuguese training sets so that the number of samples is greater than the english one. I am planning to do this test myself. Let me know if you get any good result from this approach.

rpratesh · November 21, 2019, 2:21pm

Yes @Paolo_Losi… that is good solution in general… but in my case, I’ve very large dataset for Portuguese compared to small and specific English dataset(artists/albums/songs).

@reuben When you meant by distribution, was it the histogram of how much data is there on either sides of English and Portuguese so as to balance them…
or is it the distribution of features (say avg. MFCCs) of each wav files to see if they form any cluster?

reuben · November 21, 2019, 2:26pm

I meant how much data of each language, yes.

Paolo_Losi · November 21, 2019, 2:31pm

@rpratesh for dataset I mean audio dataset for the audio model, not text dataset for the language model…

rpratesh · November 21, 2019, 3:18pm

Yeh… I also was referring to my audio dataset

varanda · December 18, 2020, 12:45am

Hi Pratesh, are you going to share your model for Brazilian Portuguese? I am writing an open-source Voice2Text App for desktop and initially it is going to support English. Having PT-BR as optional would be nice. Thanks.

SeaLiteral · January 18, 2021, 2:14pm

A bit late, but I think there’s an option nobody seems to have mentioned: I think the way I’d do that (not that I’ve got far enough to try it myself) would be to make a model whose output isn’t used directly. I think this is easier to explain with some examples, but, since I don’t speak Portuguese, I’ll use examples with Spanish instead (you should still be able to tell what they’re doing).

First, I’d make training data, both for the acoustic model and for the scorer model, that’s in Portuguese and where some sentences contain the names I want it to be able to recognize. At this point I’d have sentences like these (in Spanish because I don’t speak Portuguese):
Escucha música de Amy Winehouse.
un concierto de U2
and when preprocessing it, rather than just removing punctuation (unless you pronounce the pronunciation marks) and uncapitalizing everything, I’d also replace everything with a spelling of the pronunciation. So in the first sentence I’d replace “Amy Winehouse” with “eimi wainjaus” because that’s the closest I can do with Spanish spelling, and in the second sentence, I’ll write “u dos” since speakers of Spanish do seem to pronounce the 2 in Spanish in the name of the band U2.

Next, I’d train both the acoustic model and the language/scorer model. If I don’t have a bunch of audio of people talking about foreign musicians, I’d make a more general model first and maybe fine tune it a little bit. After this, if I try to get it to transcribe speech with those models, it should hopefully be able to write those names, although it will use Spanish spelling of some English pronunciations.

Finally, I’d run the output through some other tool that has a list of the musician/band names in their English and mangled spellings, and replaces the mangled spellings with the correct ones. Such a tool shouldn’t be very difficult to make.

I think I’ve read that some commercial speech recognition tools let you add words to their vocabularies by writing how they’re spelled and writing how they sound like they should be pronounced. So they might be doing some variant of this. I’ve never used those tools myself though.

Topic		Replies	Views
How hard is it to train a new model? DeepSpeech	3	560	February 10, 2020
Decoding for proper nouns(human names) DeepSpeech	1	667	May 15, 2018
Portuguese TTS model TTS (Text-to-Speech)	6	3705	May 15, 2020
A question regarding Portuguese dataset Common Voice	2	738	August 16, 2022
Custom Model returns wrong inferences using transfer learning DeepSpeech	4	667	February 21, 2022

Train for Portuguese that can even detect english proper nouns

Related topics