One model per accent or one model for all accents?

Hi,

I would to know what is the best approach to make recognition better for different type of English accents in various part of the world ?

1)Build a single model with all different accents combined: American, British, Indian accent…

2)Build multiple models: One model for American English, one for Indian English, one for British English,…
Then somehow, able to detect the accent of the speaker and use to right model.

  1. Or something else ?

Thanks.

The first approach is almost certainly better. Ideally, the model should learn to be accent invariant, since then the latent representation is far more robust and can better generalize to new accents. The second approach fundamentally suffers from the fact that we get access to far less data (as we now split our data based on accent) and for some obscure accents we will be data-deprived to point that training a deep model is not feasible. We also then become dependent on the correctness of the initial “accent-classifier” (for which there may not be much data if its trained independently, i.e. I am no sure if accent identificaiton is an important task in the speech domain with large datasets). More generally, this task further emphasizes the problem for obscure accents as that classifier will also be less performant and we run into problems when we find new accents and when we have accents that are near the boundary (i.e. accents probably are better understood on a gradient than by using hard categorical bins such as british and american in building models).

2 Likes

That’s my thinking too. Trying to accommodate things like speaking while eating, loose dentures, peculiar accents unique to an individual, etc. would be a monumental task.

IMHO, best-fit adjustments made in the context of surrounding words would work better than more granular “hard” decisions.

1 Like