What's wrong with image CAPTCHAS?


(Examples follow on page 2)

You know those twisted pictures of letters and numbers you have to type in on web sites. They are supposed to be a barrier to spam robots (spambots). There are two things wrong, 1) accessibility and 2) they have been cracked.


Accessibility

The current solution blocks out people with impaired sight, usually elderly people which is a large number, and the blind altogether. One workaround is to provide speech delivery of the captcha code, which many sites have now introduced.


Advanced OCR has cracked many image CAPTCHAS

Spammers have managed to crack image CAPTCHAS with advanced character recognition methods. Over 30% on some big sites. The spammers create hundreds of free email accounts automatically bypassing the Captchas, and then use the accounts to send thousands of spam mails. The spambots got by, but it very likely kept out some humans.


Smarter CAPTCHAS requiring real-world knowledge

Why not change the test and make it smarter. Instead of a twisted image to fool OCR systems, let's use plain text questions which can only be answered with real-world knowledge. This would cause a spam robot to fail the Turing test, and no amount of OCR advancement will change it. Only advances in linguistic artificial intelligence may and that is a long way off -- when the questions are carefully chosen. I decided to try out a linguistic approach. Obviously the questions need to be simple enough for most humans to easily answer, but to block spambots who don't know how to answer.


Guidelines


The main thing to keep in mind is:


Use no approach which can be automated


That sounds a bit odd given that we are using software and the goal of a programmer is to automate tasks. However, that is the downfall of all anti-spam methods so far. If you can create the anti-spam barrier with software, then other software can be written to circumvent it. So instead of CAPTCHAs which are automated, we would use questions created by humans, a linguistic Turing test. We may need to change the "Completely Automated" in CAPTCHA to "Partially Automated", so PAPTCHA. The questions are manually created, the actual challenge and response is automated.


* No questions which contain the answer


Multiple choice is easy to crack by entering each word in the question (see next page).


* No arithmetical questions because mathematics can be automated


Questions like What is two plus four are far too easy to automate in a program.


* The ultimate challenge in creating questions is a linguistic task, not a software engineering task


Success depends on how good the questions are. In contrast, the programming code is fairly trivial.


No public databases of questions


The questions should be kept secret to each site so spammers cannot create databases from public questions. Although creation requires human work, we do not need hundreds of thousands of questions. If each site creates their own questions then spammers have no pattern to follow. They can only crack the question manually by employing humans on each site, which will be expensive and not worth their time (except in those cases spammers manage to use people to enter responses without their knowledge). Very popular sites will of course need more questions, however they should also have the resources to do that. Most personal sites would simply not be worth it to spammers to spend hours on the site.


Site specific questions

Sites can create questions specific to their site. This makes it even more difficult to use responses collected elsewhere or to create linguistic bots. For example, fan sites can ask questions only the fans would know. A medical site for doctors can ask medical questions.


The local language advantage

Image Captchas are universal, when one type is cracked it is vulnerable on any site in any language. Using text-based questions means that sites will write in their own local languages. Linguistic spambots would need to be created for each language. This would mean more time and work for spammers. It would not be worth the effort for many languages.


On the next page we'll look at some examples.