Making an open source captcha from Common Voice

What about making an open source captcha that would:

  1. Validate a read aloud sentence
  2. Allow a user to speak a sentence

There is an issue with privacy with Google’s reCaptcha and many website owner are looking for a less invasive system.

I guess a nice widget saying: “Help us improve the web by participating in an opensource engine for voice recognition” would add more value to the database (more user will be inclined to validate already submitted speeches)

This has multiple advantages, it’ll reduce the delay for validated sequences and make more people aware of the project.
There is no need for microphone permission to listen and validate a sentence, so it can be very ergonomic.

The second step is when people are submitting content is a bit harder, as it requires the user to accept microphone input, and some kind of automated speech recognition to validate it’s some valid sentence being spoken. It’s harder, but it can be set up when the former validation (step 1) is failing for a given number of times.

Please notice that, for example, in the office, one could expect a user to have a headphones and be able to assert the spoken sentence is the right one, but having a microphone is not as obvious.

In all cases, I think this will give more breath to this project.

2 Likes

Somebody already filed an issue here: https://github.com/mozilla/DeepSpeech/issues/2089 and we have been thinking about that for a long time as well.

Hi @X-Ryl669 welcome to Common Voice community! :slight_smile:

You idea is really interesting, and really resonates into the field of “alternative ways to collect data”.

I’ve been reading the comments from github and the idea is interesting: Display user with 3 sentences to validate, 1 of them is already validated, if the user gets it right, the captcha is validated and we get 2 validations for free.

Do we have documentation about how other captchas work? If we are talking about a Yes/No question, someone can always click Yes and have a 50/50 chance to bypass the captcha, so I don’t really know how to improve the efficiency.

See how others have been using the API to build external tool, maybe someone wants to play with it to create a test.

Thanks for your feedback!

I was more thinking about asking the user to write down the text he’s heard. If it’s close to the expected text, accept the captcha. If it’s not like the expected text, mark the original sound as dubious, and ask the user to validate a already known as valid sequence.

If a sequence is marked dubious more than once, reject it and accept the captcha for the user.

From the user perspective, since most speech are valid, he’ll have to validate a single sequence instead of 3. If the sequence is invalid (bad luck), then he’ll have to validate 2 sequences.

The fact that the process is not known beforehand (the number of step will change pseudo randomly) will likely defeat bots but shouldn’t be too painful for a human.

We could also present a single sequence, and ask the user if what he heard is that sentence (a Yes / No answer). However, we would pass 50% of the spam bots with such scheme, so this has to be coupled with some kind of behavior analysis to only show the “limited” version when we are already sure about that the user is human.

1 Like

Interesting, are there studies or we can reach out to known experts to provide quick input to these ideas? :smiley:

Some thoughts on this:

  1. I’ve been thinking for a while that validations should probably be weighted in certain situations. So given that there will likely have to be some kind of fuzzy logic to match the sentences, each captcha validation should probably count as 0.5 votes rather than a whole vote to counteract the potential drop in validation accuracy.

  2. This method of validation only asks the user to transcribe the spoken audio. It does not provide for other situations in which a clip may be rejected, such as background voices or using text-to-speech.

  3. The algorithm needs to serve up simpler sentences - i.e. relatively short sentences with words in common usage. This means it will not be suitable for validating every sentence in the validation queue.

  4. The language should be configurable by the user. Or maybe the captcha script could detect the user’s language automatically.

  5. Three short sentences seems fine on desktop devices but it will probably be quite tedious on mobile.

Maybe typing the sentence is the wrong way to go about it? Perhaps the user could listen to the sentence once, then a list of four possible sentences appears which they can choose from. These sentences should be similar with only minor variations (e.g. randomly changing “a” to “the” or using the language model to add additional words).

This encourages them to listen carefully to it and disadvantages people trying to break the captcha with speech-to-text engines which may not manage nuance well. Then there would be an “other” option if the words the user spoke are not on the list (which indicates that they probably misspoke and the clip should be rejected).

2 Likes

You’re right, your method is easier and simpler for the user.
The more I think about it, the more I realize there is a real missing solution for funding/helping opensource algorithms.

For such a captcha, they are 3 distinctives methods here:

  1. Graphical, I was thinking about using comics and removing some some speech bubble and proposing 4 choices for the missing bubble. I expect this to be very hard to solve even for AI, because it requires OCR, then NLP then semantic parsing, and then being able to find out if it’s funny/profound or not and which one is the most funny. The ethical advantage is to make people aware of some webcomic authors (but it does not help algorithms implementors)
  2. Audio, like we are discussing. Advantage: It helps Mozilla and hopefully, all of us to get an true opensource and efficient speech recognition engine
  3. Text, have some questions based on common sense, but not obvious (not something google can answer easily), like : “If Bob and Tom are in the same class, and Tom is 6 years old, how old is Bob ?”. This is a fallback for text only browser.
  4. Not a bot, have javascript create 2 random checkbox (I’m a robot / I’m a human) and let the user click one of them. Most bot don’t run javascript and for those who does, there is no reason to select the former or the second.

I really like the idea of creating a captcha that benefits Common Voice. However let me demonstrate an attack:

I am a spam bot and want to pass the Captcha without being human. I know that the Captcha is based on CV, so I downloaded the most recent version of the dataset beforehand. I hashed every mp3 file with MD5 and stored the assigned text in a big hash table. Now, when a captcha wants me to transcribe or select a sentence, I just make a hash table lookup and enter the result. BAM! Now I am human.

The only way to prevent this is to feed the captchas with data that has not yet been published. That on the other hand decreases the amount of usable data. It would also mean that Mozilla could only release data that is at least x months old.

Another aspect that should be considered: Currently, the bottleneck does not seem to be validation but recording. How would a captcha work that is based on voice recording?

When Deep Speech pretrained models get good enough, we could aim for the recording to get some sort of score or confidence threshould, I really think that a captch would be perfect to get Common Voice to the mainstream since people browse a lot and encounter a lot captchas.