How to "un-bias" a language?

bozden · March 3, 2021, 12:55pm

OK, I recently started with ML and came here to “donate” to one language set. What I recognized after a couple of hundred record/listen sessions:

There are many sentences with foreign language words embedded (mostly names and mostly very hard to pronounce). I don’t know why they are in the dataset.
I could only hear a single female voice, one in 50 or so.
People with dialects dominate (dialects are fine & good unless they dominate)
I see that users are racing to do more, but their voices/dialects would become dominant.
Most recordings do not include correct accents (e.g. in questions), they sound like a machine speaking. This is also fine, but they dominate.

I don’t know about other languages, but want to hear from your experience from them.

So, I think the points given above will be problematic. As a newbie in this area I would like to hear your ideas on these items and their remedies.

Thanks in advance

irvin · March 4, 2021, 8:45am

@bozden which locale were you referring?

bozden · March 4, 2021, 9:06am

Thank you for answering Irvin. Actually I wrote it initially, but removed later out of courtesy to fellow donators. It is Turkish. I can see that the group is relatively new, so these problems may remedy in time with new folks.

In the meantime I went on with recordings. I’m in 50’s but from metropolitan area, some of our citizens may call me “white Turk”. My (and others’) input may balance some of these but I also have to stop at some point. Now I’m trying to recruit my wife

Joke and locale aside, this is a major problem. What is the elegant & scientifically correct way with dealing this in this environment? Are there any guidelines from your side?

One solution I can think of is forming locale based sub-forums so we can talk about these…

bozden · March 5, 2021, 3:34pm

More follow-up:
After more usage, the dialects got better on the average. But some the source texts are unacceptable.

Nobody uses the name of a Serbian middle distance runner in everyday life. I can’t even imagine how I can pronounce this…

And nobody uses the Common Voice Sentence Collector. I wonder how these sentences ended up here?

ftyers · March 5, 2021, 3:37pm

A response to two of your points:

The sentences have foreign words because many of them are from SETimes, a public domain corpus of Turkish sentences. There are very few public domain sources of Turkish sentences, and we cannot use copyrighted sentences. So this was a way of bootstrapping the system. We would always welcome the addition of more and better sentences through the sentence collector.
The solution is to encourage more recordings from Beyaz Türkler If people with regional variations predominate, it is because they are the ones contributing.

The issue with having people speak more slowly is an interesting one. I have been working with the Turkish data recently and have not found the same kind of issues that you have in this case. If you are interested in pulling out the sentences where people speak slower, you could try calculating the characters per second (transcript length / audio length) and sorting them by that.

About the correct question intonation, I do not know how do deal with that.

ftyers · March 5, 2021, 3:38pm

You can pronounce it like “Snejana Paykiç” I think.

bozden · March 5, 2021, 4:27pm

Thank you for the explanation Francis, now I understand. Indeed there is a problem of CC0 source, as this is not used in Turkey, except by some (like me). I spent a couple of hours for searching such sources for contribution to sentence collector - I did add some. If you have time, may I ask?

I don’t know the process behind sentence collector. If I enter new sentences, when will they come online?
As suggested I will collect everyday wording from National Assembly’s meeting minutes, very tough work.
There are some writers whose works become public domain (75+ years passed), can these be valid sources?

As I said, the data I hear is becoming acceptable in terms of bias. And I encourage my connections for their contribution.
One big problem is the female/male ratio. I downloaded and analyzed the voice data in Excel. Here are some results:

As it is, I don’t think it is ready for any use…

What is the schedule for publishing new downloadable data?

Another problem is with the merge of “single word commands”, indicated as “Benchmark” in the dataset. I did not see/hear any of these after hundreds of sentences. Current data is very minimal. I suggested them back (with additions from Pete Warden’s dataset) into the “sentence” collector - I’m working on TinyML as my first usage case. I don’t know if I did right thou…

Thank you for your time and support…

bozden · March 5, 2021, 5:55pm

After reading a third time I recognized what you meant with this. With the phrase “people are racing”, I meant “a handful of people a racing for the ranking in the top contributors list”, the one on the bottom right on the dashboard.

Now I’m also among them People must understand the importance of diversity here… With the current implementation, there is no tooling for diversity. You may like to encourage this openly and discourage the “top 10 list”. “Recruite a woman / your elders as volunteers and you may record 100 more” might work for example

bozden · March 5, 2021, 6:07pm

Another issue I found is with the validity of the validated data. In the dataset I saw some recordings got 2 upvotes and 1 downvote, found some of them and listened. Some of them actually should not be validated. Maybe the algorithm may be changed a bit, like, validate/reject only if the difference is 2, and revalidate the problematic ones.

Currently, for the dataset to be useful, one should listen all the recordings and handpick them before feeding them to the model. Well, it would defeat the purpose of course…

ftyers · March 6, 2021, 11:45pm

Dear Bülent,

The sentences in sentence collector are added when two people have validated them (pretty much like the clips). I realise this is a cumbersome process, but it is what we have for now.
These books are fine, although they would have to be published by an author who died before 1951, and some of the language might be difficult for modern readers. In principle we could use books by Sabahattin Ali, since he was murdered in 1948 these should be in the public domain.

The schedule for new downloadable data is uncertain. The Common Voice team is on a break until the 15th March at least, but I expect a new version will be out at some point this year.

The data provided includes all clips and includes the up/down votes, so people can take their own approach to validating if they like.

I am on the Common Voice Matrix channel if you want to discuss in real time. This discourse system is not ideal for conversation-like interactions.
Best regards,
Francis M. Tyers

bozden · March 7, 2021, 5:39am

Thank you Francis, most of my questions are answered. Other problems must be solved in time w. contributions to the data.

I was planning to start with Kürk Mantolu Madonna by Sabahattin Ali

And thanks to everyone who contributed/contributes to this wonderful project !
Bülent Özden

bozden · March 7, 2021, 6:49pm

Follow-up:

Extracted/selected/fixed/added ~2500 sentences from Kürk Mantolu Madonna to sentence collector.

My old classmates also joined the cause, starting with verification.

Topic		Replies	Views
Bias against accented speech from voting instead of transcribing Common Voice	9	826	February 3, 2023
Building an Arabic dataset for common voice Common Voice sentence-collection	16	4729	March 13, 2021
Issues in the Romanian dataset Common Voice sentence-collection , feedback , issue	7	129	February 28, 2025
Common Voice and accent choice: new paper about accents in Common Voice Common Voice	1	439	November 2, 2023
Add non-native field Common Voice feedback	5	819	May 9, 2019

How to "un-bias" a language?

Related topics