I downloaded the most recent batch of common voice corpus 6.1 for portuguese and even though it listed the number of invalid clips and the hash of the user it did not included those clips in the download.

I was trying to investigate if the failed clips are enriched for different accents of portuguese (given the prevalence of brasilian portuguese users in the dataset)…

But since those clips are not included in the download there is no way of determining this.

Do you know if there is a way to download those failed clips?

Is it possible the clips are invalid because there was a problem at recording time and there is actually no audio file available at all? I kind of remember at some point there was a regression with this behavior.

Of course that is probably the case with the majority of the invalid recordings… I am just trying to see of there is an effect of some clips or users being regected solely on their accent.

This goes back to another question I have posed in this forum on why there is no further breakdown of the portuguese language into different regional accents like you have for english

It is my hypothesis that because of the prevalence of Brazilian speaking users in comparison to Portuguese speaking users and the fact that native portuguese is very dificult to be understood by many brazilians … I think it is entirely possible that clips as being rejected simply because of this. I wanted to see if this was in fact occuring…

but without having access to the invalid clips audio I have no way of knowing the if the portuguese accent is in fact the reason from clips are flagged as invalid

for more information on this discussion you can see this other thread:

The clips for invalid files are available. I checked yesterday.