Train for Portuguese that can even detect english proper nouns


I have trained DeepSpeech to recognize Brazilian Portuguese. I have trained it with Brazilian proper nouns, music artists/songs/albums also and changed the language model accordingly. It’s working fine for now.

Now I want it to recognize the English proper nouns (and English music artists/songs/albums) .

  • One approach I tried is I’ve included those English words in Language model (with many variations in sentences). But still the detection is not good.
  • Second approach is that I’ve included English speech dataset in both train and dev, along with Portuguese datasets. This has slightly improved the English proper nouns detection, though at a huge cost of Portuguese detection.

Would like to know, if there are any more suggestions that I can try out, to achieve good enough Portuguese detection with ability to recognize the English proper nouns (with one acoustic model and one lang model).

Hi @lissyx @kdavis @reuben … any suggestions !

I would take a closer look at the distribution of English vs Portuguese in the training set to better match the distribution you’re aiming for (only a few English words). Also, if every sample in your training set is either completely in English or completely in Portuguese, that could be a problem, as you want the model to be able to handle code switching. So I would try to augment the training set by joining English and Portuguese samples to create some sort of artificial code switching.

@reuben Thanks… when you mean “artificial code switching”, does it mean some decision that code (not the ASR model) takes based on input audio?..

Could you elaborate more on this

So, suppose I’ve a complete audio dataset of utterances with code-switch (like “tocar musicas de alan walker”) using a huge list of famous artists/albums/songs in Brazil (that includes both Portuguese and English names).

Do training with that dataset will be enough for DeepSpeech to handle code-switch?

Also, any suggestions on how to pick the final dataset by looking at my Portuguese, English and mixed dataset (that also includes code-switching)

I can’t see the future.

I’ve already said how I would do this in my first message.

@rpratesh one simple solution (as @reuben has suggested) is to change the weight of portuguese audio samples with respect to english ones. I guess that english audio samples outnumber the portuguese ones. One simple way to increase the weight of portugues is to concatenate multiple copies of the portuguese training sets so that the number of samples is greater than the english one. I am planning to do this test myself. Let me know if you get any good result from this approach.

Yes @Paolo_Losi… that is good solution in general… but in my case, I’ve very large dataset for Portuguese compared to small and specific English dataset(artists/albums/songs).

@reuben When you meant by distribution, was it the histogram of how much data is there on either sides of English and Portuguese so as to balance them…
or is it the distribution of features (say avg. MFCCs) of each wav files to see if they form any cluster?

I meant how much data of each language, yes.

@rpratesh for dataset I mean audio dataset for the audio model, not text dataset for the language model…

Yeh… I also was referring to my audio dataset

Hi Pratesh, are you going to share your model for Brazilian Portuguese? I am writing an open-source Voice2Text App for desktop and initially it is going to support English. Having PT-BR as optional would be nice. Thanks.

A bit late, but I think there’s an option nobody seems to have mentioned: I think the way I’d do that (not that I’ve got far enough to try it myself) would be to make a model whose output isn’t used directly. I think this is easier to explain with some examples, but, since I don’t speak Portuguese, I’ll use examples with Spanish instead (you should still be able to tell what they’re doing).

First, I’d make training data, both for the acoustic model and for the scorer model, that’s in Portuguese and where some sentences contain the names I want it to be able to recognize. At this point I’d have sentences like these (in Spanish because I don’t speak Portuguese):
Escucha música de Amy Winehouse.
un concierto de U2
and when preprocessing it, rather than just removing punctuation (unless you pronounce the pronunciation marks) and uncapitalizing everything, I’d also replace everything with a spelling of the pronunciation. So in the first sentence I’d replace “Amy Winehouse” with “eimi wainjaus” because that’s the closest I can do with Spanish spelling, and in the second sentence, I’ll write “u dos” since speakers of Spanish do seem to pronounce the 2 in Spanish in the name of the band U2.

Next, I’d train both the acoustic model and the language/scorer model. If I don’t have a bunch of audio of people talking about foreign musicians, I’d make a more general model first and maybe fine tune it a little bit. After this, if I try to get it to transcribe speech with those models, it should hopefully be able to write those names, although it will use Spanish spelling of some English pronunciations.

Finally, I’d run the output through some other tool that has a list of the musician/band names in their English and mangled spellings, and replaces the mangled spellings with the correct ones. Such a tool shouldn’t be very difficult to make.

I think I’ve read that some commercial speech recognition tools let you add words to their vocabularies by writing how they’re spelled and writing how they sound like they should be pronounced. So they might be doing some variant of this. I’ve never used those tools myself though.