How to "un-bias" a language?

OK, I recently started with ML and came here to “donate” to one language set. What I recognized after a couple of hundred record/listen sessions:

  • There are many sentences with foreign language words embedded (mostly names and mostly very hard to pronounce). I don’t know why they are in the dataset.
  • I could only hear a single female voice, one in 50 or so.
  • People with dialects dominate (dialects are fine & good unless they dominate)
  • I see that users are racing to do more, but their voices/dialects would become dominant.
  • Most recordings do not include correct accents (e.g. in questions), they sound like a machine speaking. This is also fine, but they dominate.

I don’t know about other languages, but want to hear from your experience from them.

So, I think the points given above will be problematic. As a newbie in this area I would like to hear your ideas on these items and their remedies.

Thanks in advance

@bozden which locale were you referring?

Thank you for answering Irvin. Actually I wrote it initially, but removed later out of courtesy to fellow donators. It is Turkish. I can see that the group is relatively new, so these problems may remedy in time with new folks.

In the meantime I went on with recordings. I’m in 50’s but from metropolitan area, some of our citizens may call me “white Turk”. My (and others’) input may balance some of these but I also have to stop at some point. Now I’m trying to recruit my wife :grinning:

Joke and locale aside, this is a major problem. What is the elegant & scientifically correct way with dealing this in this environment? Are there any guidelines from your side?

One solution I can think of is forming locale based sub-forums so we can talk about these…

More follow-up:
After more usage, the dialects got better on the average. But some the source texts are unacceptable.
Nobody uses the name of a Serbian middle distance runner in everyday life. I can’t even imagine how I can pronounce this…

And nobody uses the Common Voice Sentence Collector. I wonder how these sentences ended up here?

A response to two of your points:

  • The sentences have foreign words because many of them are from SETimes, a public domain corpus of Turkish sentences. There are very few public domain sources of Turkish sentences, and we cannot use copyrighted sentences. So this was a way of bootstrapping the system. We would always welcome the addition of more and better sentences through the sentence collector.
  • The solution is to encourage more recordings from Beyaz Türkler :slight_smile: If people with regional variations predominate, it is because they are the ones contributing.

The issue with having people speak more slowly is an interesting one. I have been working with the Turkish data recently and have not found the same kind of issues that you have in this case. If you are interested in pulling out the sentences where people speak slower, you could try calculating the characters per second (transcript length / audio length) and sorting them by that.

About the correct question intonation, I do not know how do deal with that.

You can pronounce it like “Snejana Paykiç” I think.

Thank you for the explanation Francis, now I understand. Indeed there is a problem of CC0 source, as this is not used in Turkey, except by some (like me). I spent a couple of hours for searching such sources for contribution to sentence collector - I did add some. If you have time, may I ask?

  1. I don’t know the process behind sentence collector. If I enter new sentences, when will they come online?
  2. As suggested I will collect everyday wording from National Assembly’s meeting minutes, very tough work.
  3. There are some writers whose works become public domain (75+ years passed), can these be valid sources?

As I said, the data I hear is becoming acceptable in terms of bias. And I encourage my connections for their contribution.
One big problem is the female/male ratio. I downloaded and analyzed the voice data in Excel. Here are some results:


As it is, I don’t think it is ready for any use…

What is the schedule for publishing new downloadable data?

Another problem is with the merge of “single word commands”, indicated as “Benchmark” in the dataset. I did not see/hear any of these after hundreds of sentences. Current data is very minimal. I suggested them back (with additions from Pete Warden’s dataset) into the “sentence” collector - I’m working on TinyML as my first usage case. I don’t know if I did right thou…


Thank you for your time and support…


After reading a third time I recognized what you meant with this. With the phrase “people are racing”, I meant “a handful of people a racing for the ranking in the top contributors list”, the one on the bottom right on the dashboard.

Now I’m also among them :frowning: People must understand the importance of diversity here… With the current implementation, there is no tooling for diversity. You may like to encourage this openly and discourage the “top 10 list”. “Recruite a woman / your elders as volunteers and you may record 100 more” might work for example :slight_smile:

Another issue I found is with the validity of the validated data. In the dataset I saw some recordings got 2 upvotes and 1 downvote, found some of them and listened. Some of them actually should not be validated. Maybe the algorithm may be changed a bit, like, validate/reject only if the difference is 2, and revalidate the problematic ones.

Currently, for the dataset to be useful, one should listen all the recordings and handpick them before feeding them to the model. Well, it would defeat the purpose of course…

Dear Bülent,

  1. The sentences in sentence collector are added when two people have validated them (pretty much like the clips). I realise this is a cumbersome process, but it is what we have for now.
  2. These books are fine, although they would have to be published by an author who died before 1951, and some of the language might be difficult for modern readers. In principle we could use books by Sabahattin Ali, since he was murdered in 1948 these should be in the public domain.

The schedule for new downloadable data is uncertain. The Common Voice team is on a break until the 15th March at least, but I expect a new version will be out at some point this year.

The data provided includes all clips and includes the up/down votes, so people can take their own approach to validating if they like.

I am on the Common Voice Matrix channel if you want to discuss in real time. This discourse system is not ideal for conversation-like interactions.
Best regards,
Francis M. Tyers

1 Like

Thank you Francis, most of my questions are answered. Other problems must be solved in time w. contributions to the data.

I was planning to start with Kürk Mantolu Madonna by Sabahattin Ali :heart:

And thanks to everyone who contributed/contributes to this wonderful project !
Bülent Özden


Extracted/selected/fixed/added ~2500 sentences from Kürk Mantolu Madonna to sentence collector.

My old classmates also joined the cause, starting with verification.

1 Like