Posts Tagged ‘PHP GD’

Breaking a CAPTCHA – rules for good design

Sunday, June 7th, 2009

A CAPTCHA is a challenge and response test used to verify that the end-user is a human and not a computer – CAPTCHA is an acronym standing for a Completely Automated Public Turing test to tell Computers and Humans Apart.

Captchas seem to have become increasingly popular as a method to prevent the submission of spam and automated responses. You can see a captcha generated by the popular reCAPTCHA service when you post a comment on this blog.

The main problem with the Captcha is that sometimes the people who implement them are lazy or have no knowledge about how create an image that a computer would find hard to decode. Captchas must be generated server side and over the last few months I have seen an increase in the number of client-side captchas generated by software such as Adobe Flex. If you generate a Captcha client side it is not secure.

When designing a Captcha its important to understand what computers find it hard to do.

Its hard for a computer to segment an image. Computers need to segment an image in order to classify each character of text. Anything you can do – such as running letters together – that makes the image harder to segment will make it much harder for a computer to segment your image.

Lets look at the following example to see how easy it is to segment the text from a badly designed captcha.

Image segmentation in 3 steps, (1) Capture the image, (2) Apply thresholding (3) Convolution Matrix

Image segmentation in 3 steps, (1) Acquire the image, (2) Apply thresholding (3) Apply a simple Convolution Matrix

Two very simple algorithms have been applied to the Captcha above. Firstly thresholding and secondly a convolution matrix to remove the vertical and horizontal lines. If we look at the captcha below we can see that some captchas can be segmented just by using thresholding alone.

Even worse - The background can easily be removed by just simple thresholding

Even worse - The background can easily be removed by just simple thresholding

Once the image has been segmented the computer then has to then classify each character. The more options there are for each character the higher the chance of the computer classifying the letter incorrectly – so when your designing your captcha it pays to use the entire alphabet and not restrict yourself to just numbers or letters – It follows that the longer your captcha the more chance there is of the computer making a mistake in classification. Google makes its captchas between 8 and 11 characters in length.

Both of the captchas we have seen so far are easy to segment – they are also easy to classify. This is due to the similarity of the characters within the image (both 4’s look the same in the second captcha) – if we want to make it tricky for the computer to classify each character we need to use different fonts for each character or warp each character by applying rotation or other image morphing operators. The following Captcha from Yahoo! is much harder to segment due to there being little space between each character, the captcha also uses both upper and lower case characters and has been morphed so that the string is harder to classify.

A set of captchas from Yahoo! that are very hard to segment

A set of captchas from Yahoo! that are hard to segment and classify

Sadly as captchas become harder for computers to read they also become harder for humans to read – there is a fine line between providing the necessary security and frustrating a user with a captcha that is impossible to read.

Google has an interesting solution to this using Markov Chains – here random strings that appear to be words are generated using a statistical method known as a Markov Chain. These words are much easier for a human to read because they seem to be a normal word, however they are not words and this is important. If dictionary words were use then a dictionary could be introduced to improve captcha classification rates.

Google captchas use markov chains to make them easier to read

Google captchas use Markov chains to make them easier to read

Its pretty easy to design and write a good captcha using PHP GD or something similar. If you cant be bothered to write a captcha then services such as reCAPTCHA exist which can provide you with an effective captcha solution (although this is vulnerable to the “penis flood attack“)

No captcha will ever be 100% secure, rumor has it that even google’s captcha has been broken with a classification rate of 20%; there are even stories of captcha sweatshops emerging around the world where people are paid  to solve Captchas – a kind of mechanical Turk.

As algorithms become more sophisticated an alternative to captchas need to be found but until these have been found you may as well make sure that your captchas are secure.