Wednesday, May 23, 2007

reCAPTCHA: A new way to fight spam

You've probably seen a CAPTCHA before. It's those funky letters you have to enter before you sign up for an account on almost any website. I'm proud to announce a new type of CAPTCHA: reCAPTCHA: (click to see a live demo!).

You might notice that reCAPTCHA has two words. Why? reCAPTCHA is more than a CAPTCHA, it also helps to digitize old books. One of the words in reCAPTCHA is a word that the computer knows what it is, much like a normal CAPTCHA. However, the other word is a word that the computer can't read. When you solve a reCAPTCHA, we not only check that you are a human, but use the result on the other word to help read the book!

Luis von Ahn and myself estimated that about 60 million CAPTCHAs are solved every day. Assuming that each CAPTCHA takes 10 seconds to solve, this is over 160,000 human hours per day (that's about 19 years). Harnessing even a fraction of this time for reading books will greatly help efforts in digitalizing books.

reCAPTCHA provides an easy to use API for putting CAPTCHAs on your site. Installing is as easy as adding a few lines of code to your HTML and then making a HTTP POST request to our servers to verify the solution. We also wrote plugins for WordPress, MediaWiki, and phpBB to make it very easy to integrate.

One other interesting service reCAPTCHA provides is a way to securely obfuscate emails. Many sites display emails like bmaurer [at] foo [dot] com or use hacks with tables, javascript or encodings to get the same effect. Spammers are getting smarter and figuring out these tricks. Spammers are especially diligent at working around the strategies of well known open source software. Consider this warning on bugzilla.mozilla.org:

Although steps are taken to hide addresses from email harvesters, the spammers are continually getting better technology and it is almost guaranteed that the address you use with Bugzilla will get spam.

reCAPTCHA Mailhide provides a scalable solution to email obfuscation that can be widely deployed without being breakable. Mailhide provides a way to encrypt a user's email with a key only reCAPTCHA knows. reCAPTCHA will only display the email address when the user solves a CAPTCHA. With reCAPTCHA, I can display my email address as bmau...@andrew.cmu.edu. If you click on the three dots and solve a CAPTCHA, you can see my address. Mailhide provides a way for individual users to encode their email address as well as an API for services (like Bugzilla) to share an encryption key with reCAPTCHA.

If you're suffering problems with spam, take a look at reCAPTCHA. Not only can you solve your problems with spam, you can help preserve mankind's written history into the digital age!

74 comments:

Anonymous said...

This is a brilliant idea, excellent work!

Alan said...

Is this in conjunction with Project Gutenberg?

Anonymous said...

so if you misread the first word (from the book) and get the second one (the captcha) right, the book will be transcribed incorrectly?

not much safer than ocr is it?

Ben Maurer said...

We're working with archive.org to get books for right now.

Ben Maurer said...

We use a voting system to avoid human typos (or an evil person who wanted to have books read "shit shit shit"). We're still tuning this system to get good accuracy while using humans optimally.

Hendrik said...

That's as brilliant as Google's Image Labeler, though I hope that reCAPTCHA's results are made public so they can be useful for everybody, not only to a single company.

lazka said...

what about just switching the two words randomly. this should at least give a few percents more accuracy.

Ben Maurer said...

Google's image labeler was invented by Luis von Ahn, my adviser on reCAPTCHA. (esp game is the pre-google name)

Anonymous said...

That's genuinely brilliant. It's microtransaction karma!

Quick question though, does the API allow for it to be restyled? I'd be uberhappy to use this as my captcha solution in the future, but explaining/selling clients on what reCaptcha is and why their logo is on their site might be a little difficult...

Anonymous said...

This is the coolest project I've seen in a month. Great stuff.

Thomas said...

This is a really good idea, excellent work :)

Ben Maurer said...

Right now, the API allows for a limited number of styles, all of which include the reCAPTCHA logo.

If you contact us at support@recaptcha.net we may be able to work out a branding-free option.

Ploum said...

But captchas are, at best, annoying ! It's a shame that people that want to post a comment must pass this painful step. And I don't speak about people with difficulties.

Really, captchas must be avoided. And there are plenty of way :
http://ploum.frimouvy.org/?150-the-invisible-captcha-mechanism-icm-against-form-spam

Ploum said...

The complete URL : http://tinyurl.com/3arrjc

(and one more captcha for me)

Ben Maurer said...

reCAPTCHA includes audio CAPTCHAs, we believe accessibility is extremely important (I agree, it's a shame most sites with CAPTCHAs aren't accessible).

While CAPTCHAs can be annoying, javascript based "CAPTCHAs" don't really work. Sure, if you implement them on your site by yourself, it might be fine. But if N people use it, it will be worth figuring out how to break. The same applies to "1+1=?" tests.

Anyways, none of those solutions allow you to contribute to the effort to digitalize books :-).

Anonymous said...

It's not really a new idea - it's been documented that some evil sites forward the captcha of other sites so that a human will crack the code instead of having to try to OCR it.

Anonymous said...

I might sound obtuse but I think it should be mentioned somewhere that the two words should be space separated (or comma separated).

Sardak said...

Great idea! Harness the power of the masses. There are a good number of analogies to be drawn between renewable energy and internet usage.

Anonymous said...

What if it's a rude word?

aizatto said...

Haha, brilliant, I like it. But the only side effect I see, is that if the computer can now read the captcha, how do we block those computers when we want to prevent bots. :P

knocte said...

We use a voting system to avoid human typos

Why not using a double check to confirm the word is correct? I mean, show the word to two persons, if the typed word is the same, insert it in the Database for OCR, but if there's a mismatch, wait until there are two words that are the same.

what about just switching the two words randomly. this should at least give a few percents more accuracy.

That's a nice improvement also.

Donkey said...

Nice idea, but now that we have a technology that learns to read captchas... couldn't it be used to overide captchas???

Mark said...

Ben, bad example your email address is linked from here, with no protection...

http://www.blogger.com/profile/00743319148021355050

The only way to avoid being sent spam on an email is to never use it :-)

Russ Jones said...

So, let me get this right, if 1 web spammer writes the 1 algorithm that works on reCAPTCHA, everyone is screwed?

For a web service to work, the captcha needs to (1) not be word dependent (vulnerable to dictionary-based cracks) and (2) be frequently changing.

No thank you reCAPTCHA, in solving the problem with scanning books, you have unsolved the problem of spam.

Anonymous said...

Ross: If 1 spammer writes a program to defeat this CAPTCHA, then: (1) That program can be used to improve *actual* OCR of old books, and thus can help the world. (2) It seems easy to change the distortions used since this is a web service. This is even better than if everybody installed some CAPTCHA that was not a web service -- if 1 spammer wrote 1 program that could hack it, everybody would have to patch their blog.

Kudos to reCAPTCHA.

davejuk said...

What's to stop a spammer advertising free pron or something, but before a user gets access, they need to solve a captcha.

When the captcha is loaded, it's retrieved from a sign-up page for a site the spammer wants to abuse. As soon as the user solves it, the spammers application has the key it needs to complete registration and voila, they have a new account to start spamming with.

Meanwhile, the user gets access to their free pron and is none the wiser.

Alper said...

I watched your adviser Luis von Ahn's Google Talk. It was amazing.
Most of the comments here are focusing about drawbacks of this method or they are saying it does not stop spam. The spam will evolve as security methods evolve, it cannot be stopped completely. But Luis von Ahn and Ben Maurer are working on fighting spam and harnessing the human computation. It was brilliant way to use captchas for image labeling and it's getting more brilliant to use them as OCR.

Anonymous said...

So let's see if I understand...

In order to access a page protected by reCAPTCHA, I must not only provide a word to prove I am human, I must also provide some unpaid labor?

If this idea takes off I see a whole bunch of applications:

- reMOWA
You enter one word, then mow my lawn before I'll let you in

- reWASHA
You enter one word, then wash my wife's car

- reHOMEWORKA
You enter one word, then solve an exam problem for needy 5th graders

- reSETIA
You enter one word, then help the search for Alien life

- reMMORPGA
You enter one word, then must increase your power by one level

the possibilies are endless!

DocJeff said...

CAPTCHAs are not very kind to the visually disabled. I'm totally colourblind (think greyscale but applied to vision) and some of the CAPTCHAs, perhaps 80%, are impossible for me to see.

Some places have started adding audio playback of the CAPTCHA but even that is occasionally difficult to hear.

Anonymous said...

It seems reCAPTCHA does not rely on color vision, and has an audio version.

Ben Maurer said...

About the porn weekness: yes, we have this to some extent.

However, we are already working on ways to detect when this is done. Our API design included this requirement (it is the only reason you have to register: to make it a bit harder to do the porn thing). We're going to work on detecting patterns that might indicate an attempt to use porn to break our CAPTCHCA

Anonymous said...

a ingenious way to put unused resources to work (reminds me of the seti project) -- fantastic idea! i wish you great success with the project..

Josh S. said...

Great idea and seems to be implemented well. Good work!

(Now you just need to figure out how to use it for comments on your blog!)

Tanel said...

installed email protection on my website :)

Anonymous said...

you will have to switch the two words randomly, otherwise people will know that they need to type the captcha first and the second word is not relevant. people are lazy and will type "[captcha] blabla" all the time if they know that works

Anonymous said...

Awesome idea! For wasting a sh*tload of everyone's time! Thanks for putting more roadblocks and speedbumps on the web!

Anonymous said...

Ben Maurer, this is a brilliant idea- it's simple and solves a useful and real problem.

I look forward to seeing this in action!

agloco said...

sounds very interesting.. so how can we implement this?



-
jrock
make money online surfing

Inboulder said...

Brilliant!!
I notice you're not using it on your blog though

Anonymous said...

it seems it's switching now.. cool

Brad GNUberg said...

This fricken rocks. Brilliant idea and execution.

Best,
Brad Neuberg
http://codinginparadise.org

Anonymous said...

*Luis and I*! (Not "myself".)

peter said...

I'd love to use it ... however on that one particular site I have in mind the design asks for 2 comment forms, one on top of the comment list and one at the bottom ... unfortunately reCaptcha cannot deal with that situation and there seem to be no plans in place when this issue would be dealt with... too bad

Anonymous said...

An awesome way to apply the "human computation" concept. Congrats.

You can also use Contactify to avoid revealing your email address.

JS said...

In other words, reCAPTCHA misuses users to do a work for you. The work you would otherwise need to pay for. Not that great idea after all.

Sean McManus said...

It's a nice idea, but it needs to be opt-in. People hate captchas as it is (they often involve repeated attempts, and they're inaccessible to those using assistive devices). Forcing people to do two, one of which is irrelevant to the task at hand, is going to annoy people.

Users who don't like the idea can too easily spoof it. You can type in 'donkey' for the book word (whatever it is) and as long as you get the other one right, the system says 'correct' and lets you in. One of the words I got was illegible and might not even be a scan of a complete word. That was the book word it turned out, so it didn't matter that I typed in nonsense for it. There's a good chance that the number of people who enter 'crap' for the book word will be greater than those who enter the correct word, which means you can't even go with the majority.

Actually, I've just had to enter the captcha here twice to get it to post. Captchas are just broken.

Sean McManus said...

I should add that it's often obvious which word is the book word because it is a proper noun or an antiquated word, for example.

Pensador said...

Brilliant idea! Congratulations! I shall integrate it to my custom blog system.

Wynand said...

Wow. This is a wonderful idea.

I both agree and disagree with Sean.

I think it would be possible make it very hard to distinguish between the book word and the computer word.

On the other hand, it would be nice to allow people to opt out (and to present them with normal captchas) by supplying a button which will point out the computer word. This at least provides some goodwill and I doubt most people would opt out anyway.

Again, this idea is so marvelous because it is so simple.

Anonymous said...

I'm really amazed at the amount of negativity many of these comments display toward a really brilliant idea. I run a forum, and I can tell you from first-hand experience that automated spambots have made keeping a forum clean nearly impossible. Many forums running phpbb have become completely overrun with thousands of spambot registrations and tens of thousands of spam postings. Running a forum becomes a daily exercise in cleaning out lots of spam accounts.

To all the people who don't like CAPTCHAs: Tough Noogies. Do you want to be inundated (and have to read) endless streams of spam? Probably not. The only thing that has even a remote chance of reducing spam is a test of the human-ness of users. Using that otherwise wasted effort for the betterment of humanity is brilliant and very worthy of recognition. Complaining about it because the users are unpaid misses the point that they'd be unpaid regardless of whether there was a benefit to their effort.

I've replaced the stock phpbb CAPTCHA on two of my forums with the reCAPTCHA service. So far, so good.

Gayle said...

This is awesome. Do you have any examples about how to use with a C#.NET app? I'd add this in a second if so.

Anonymous said...

but most people won't know they are involved in something greater than a two word CAPTCHA unless they click on an about link or something similar. which they won't. they'll just think it's a slightly altered CAPTCHA. the numbers different systems we've seen cropping up are quite gigantesque, so what's to differentiate this one to the lay-man

grepper said...

I don't know where people get the idea of a "book" word vs. a "computer word"

It uses "known" and "unknown" words. The known word could just be a word it had previously shown to enough people to "know" what it is.

Assuming it randomly chooses which comes first, you couldn't by lazy and "just type adsfdsaf" for the "book word" because there is no way to know which is which.

As for captchas being annoying and having to do two... The ones I tried all seemed more legible then those on ticketmaster, and I'm not color blind. Ticket master's computer generated noise is often completly unreadable.

You should try to switch ticketmaster to using your system. Have them do 3 or 4 words if ticket master wants more protection. It would get you a lot translated, and it would make it _easier_ for people to buy tickets.

Julien Couvreur said...

This is a great system. All three parties (the website, the user and reCaptcha) benefit.

My main concern is that the security of this CAPTCHA is effectively half of a normal CAPTCHA of the same length, or from the user perspective, it's twice as much "labor" as usual websites.

Finally, have you considered using random letters, rather than a valid dictionary word, for the "known word" part of the challenge? This would probably make the CAPTCHA more secure.

Julien Couvreur said...

Forgot one thing: do you cross-validate the answers from multiple users?

Also, you might consider alternating the sequence for the "known" and "unknown" words. It seems that the current system always puts the "known" word first. Users could figure this out and get in with half the "labor".

nitesh said...

wonderful work.

Anonymous said...

That's amazing. Whoever came up with it deserves a cookie and a ton of mangoes.

One concern though. When I was doing the samples, I came across a word with a ü. I suspect my version would get lost in the voting process because a lot of people don't know how to type ü. (Yay for compose keys!)

Perhaps it would be worth addressing?

RK said...

Hey..suggest it to Wordpress.com guys..and blogspot guys...
Good idea.

Motorcycle Guy said...

This is a pretty awesome idea. Kind of like google's image labeler but I think the spreading it across any website part is pretty brilliant. Google could probably do that with their image labeler as well.

Anonymous said...

"Chinese radio scare alert: these people want to exploit your brainpower with their captcha tricks! It's like enslaving humanity, one word at a time!"

You certainly get extra points for originality of your idea.

I'm sure nobody will get my Chinese radio reference though...

Anonymous said...

I don't see how a public/private key can be used to solve the porn-problem (social engineering problem). Care to shed some light on that?

Damien Jorgensen UK said...

Great little idea

Domas Mituzas said...

thumbs up.
how many captcha auths/s can you handle? :)

Ben Maurer said...

With our current setup (4 servers), I think I'd be very comfortable doing 200 auth/s average (obviously with a higher peak). We can do more, however I am accounting for potential server failures or data center outages.

Regardless, I think we have more than enough room for growth. Also, our stuff scales linearly by adding more computers to the mix. Worst case: we need to order a computer.

Anonymous said...

It is a laudable project, but I still think that CAPTCHAs are a fundamentally flawed concept.

I don't think the audio CAPTCHA is terribly accessible, and even if this is developed further, it still excludes deafblind users.

Whilst deafblind people might not compose the biggest demographic group in the world, we're still talking about tens of thousands - if not hundreds of thousands - of people. And dare I suggest that the web is more important to those users than almost any other group...

Anonymous said...

Please stop bumping this post. It keeps re-appearing in Planet aggregators, which is rude and akin to reposting the same content over and over just to get some more exposure.

Andrew Parker said...

I find it ironic that I have to fill out Google's old school CAPTCHA in order to comment on this post.

GREAT JOB! What a cool idea!

Ben Maurer said...

I'm sorry about bumping the post. I had to make a few edits, and it looks like planet is looking at when I updated the post. Complain to the developers ;-).

Anonymous said...

Where can I download a C# version of the api?

Anonymous said...

I had to deactivate this because it wasn't letting any comments through and didn't alert users that their comment did not go through - that should be worked on.
Just a question - Why arean't you using it for your comment system?

Ben Maurer said...

>I had to deactivate this because it >wasn't letting any comments through and
>didn't alert users that their comment did
>not go through - that should be worked on.

Could you give some more details?


>Just a question - Why arean't you using
>it for your comment system?

Blogger doesn't let me ;-(.

Jody+ said...

This is a great idea. I downloaded the wordpress plugin last night however and this morning one of my commenters encountered an error in IE. I duplicated it in IE7...when a person goes to the comment page they are told that "internet explorer cannot open the internet site ______ operation aborted" after which the page goes to a blank page.

I read that this error is often triggered by java script and tables in a particular order in the body section, but my theme uses divs, not tables. Any thoughts about what the problem may be? I'd love to use this but I can't until I find a solution...

Jaaky said...

reCaptcha is cool. I use it in my site.

Catgofire said...

I like it, but I'm doing a site that can't require JavaScript (don't ask me, ask Internet Explorer)... that solution is kind of cumbersome.