Segmentation of Portuguese into different accents

Hi.

When I filled in my profile I indicated I spoke 2 languages. English and Portuguese.

I noticed that you gave options for accents in English but none for Portuguese.

I believe this is not entirely helpful since I believe the overwhelming majority of speakers you are recording in Portuguese are Brazilian.

I am Portuguese from Portugal and therefore my Portuguese clips are considerably different in terms of accent. I believe given the prevalence of Portuguese Brazilian contributions, it would be advisable to further segment people like me into a separate accent.

Best regards

Duarte Molha

Hi there! This post discusses some of the issues, e.g. for languages which don’t have them yet it might be worth waiting for the new accent and variant strategy to come online before defining them. But as @phirework says

“… if you can demonstrate consensus from linguists/other experts for a list of regions in Italy/the diaspora that align with the Language and Accent strategy, I’d be happy to take a look at a pull request (see Galician as a recent example).”

I would add that this is for training speech recognition, where a single model can take advantage of speakers from a wide variety of accents, even if they are very different and non-native speakers. For example, I imagine your accent in English is quite different from mine, but it still helps to have your accent in the English model. So it will help to have your accent in the Portuguese model, and if they are so different, it will be easy for a machine to distinguish them. And if they aren’t that different, then it doesn’t really matter. :slight_smile:

Portuguese speakers from Portugal can very easily understand Brazilian speakers (apart from some specific words or lingo native to brazil )

But I know from experience Brazilian people have a great difficulty understanding Portuguese from Portugal. The reason is mainly because the Portuguese from portugal use more closed vowels sounds that was very difficult for brasilian people that are accustomed to open vowel sounds

Eu sei, mais não é um pobrema pra o sistema de reconocimento de voz. :slight_smile: Eu não falo português muito bem, mais pra mim também é mais facil comprender uma pessoa de Brazil que de Portugal. :smiley:

A sua resposta confirma a necessidade de separação dos dois.

Ao contrário, o pobrema é que eu não tenho muita experiência enquanto falar com as pessoas de Portugal. Se a gente quer que o sistema tinha experiência tem que dar-lhe exemplos de tudos os acentos. Eu penso que se eu teria issa experiência, eu também pudesse comprender os dois acentos! :slight_smile:

Sorry … I will reply in english. Even though I am portuguese… I am better at describing things in english.

I believe you are seeing this from an incorrect perspective. The system relies on human curators… I believe that because we are not segmenting portuguese into a different accent Brasilian Portuguese speakers are given Portuguese (from portugal) clips to curate.

I have little doubt that many of my clips I have submitted will be rejected by brazilian curators as they will not understand the accent and will consider it incorrect.

I am not saying that the system does not benefit from different accents … but if it does not matter why are many english accents listed in the system?

That is a very easy experiment to conduct. Download the Portuguese data from the 6.1 release and look at a sample of clips with down votes. Perhaps you are right, but I would guess you are not.

If I contributed to the English Common Voice (I don’t, there are quite enough white English guys in there already!) then I would skip those contributions in accents I did not understand. I would only down vote those in which the user said something I understood but believed to be wrong.

In terms of why English accents are listed. That is because those languages are following an old system of accent/dialect encoding.

In any case, as I pointed out in my first reply the system of accents/dialects etc. is in flux at the moment. I don’t think it is necessarily bad to collect that information. I just don’t think it is necessary if you are interested in using the data for speech recognition.

PS. I tried contributing to the Portuguese Common Voice. When I did listening I think 1/10 voices were Portuguese, although I wasn’t sure about one of them, so maybe 2/10. I’d love to hear more!

latest stats say that about 65% of Brasilian population have internet that is about 137M

in Portugal that is about 75% so about 9.5 million

So that is about 7% of the total potential brazilian users

Assuming there are others there from other portuguese speaking countries the ratio is probably around 90% Brazilian and 10% others :slight_smile:
For example Angola is the second largest portuguese speaking country and has almost 3 times the population of Portugal but only 14% of them have internet (around 4.5M)

1 Like

I personaly have reviewed quite a few portuguese clips and I have yet to find any Portuguese (from portugal) speakers… maybe the ones you got are mine! :slight_smile:

1 Like

BTW… the downloadable set does not include invalid samples (at least that is what they say on the download page)

I am downloading now so I will check :slight_smile:

I was right … the downloaded dataset does not have in invalid clips so there is no way to check the reason why the clips are invalid

Many of the invalid clips where from few individuals. One had 344 clips invalidated (almost 20% of all invalid clips).

I wonder what happened there? It could be that the microphone is bad quality and is getting rejected. But it would be interesting to see if these rejected clip are enriched for Portuguese (from Portugal) speakers… unfortunately … those clips are not included in the download to analyse.

1 Like

That’s interesting. Another thing you could do is look at the clips with a single downvote and see if Portuguese speakers from Portugal are more represented there.

that is more than 3300 clips :slight_smile:

this is not my full time job heheheh

A sample of 50-100 should probably be sufficient to get an approximate idea.

Not sure it would be robust enough to make any conclusive determinations. Firstly these are all samples that have been considered valid (even if they have downvotes)
So … even if they are enriched for native portuguese clips I could not say it was having a negative effect as they (apart from additional downvotes) where still passing curation
And also, assuming a 90% to 10% ratio of portuguese to Brasilian speakers… and analysing 100 samples only about 10 would be portuguese on average. I do not think I can make any kind of assumptions based on those low numbers.

But anyway these are valid clips so I do not think they are the ones I should consider.

For me the only way to really make any robust conclusions would be so see if the population of native portuguese samples in rejected clips has the same average and SD as the population of native Portuguese (from portugal) speakers in the validated samples.

Only that would give me a good signal that Portuguese from portugal was being rejected more than it should

Yes, that would definitely be statistically more satisfying!