Archive for the ‘Security’ Category

Detecting a fake email address using Markov chains

Saturday, August 22nd, 2009

Markov chains are a set of states where any state is only dependant on the previous state. These can be used to generate “real-looking” words from a given set of text. By the same methods we can decide if a string is a valid word or a load of garbage by assessing each letter and its subsequent letter in word. If the probability of letter N+1 coming after N is very small then we can probably say that the chance of the string being a word is very small.

When users sign up with a fake email address they tend not to put much thought into the name of the email. Something like sdfjsldkf87we@example.com is a good example. To filter these email addresses out we can take a dictionary and calculate the probability of the next letter (N+1) given the previous letter (N) and compare this to what we observe in the fake email address. If the probability of the next letter is repeatedly low then we can say that the email address is probably fake.

My algorithm scores each email, giving it a point each time a letter N+1 should never come after letter N and reducing the score by 1 for every 12 characters in the email address. This additional check helps to reduce the number of false positives. I only check the initial part of the domain – that is the part excluding the @example.com

You’ll probably wonder how the code deals with non alpha-numeric numbers? I just strip them out and convert the whole email to lower-case. There is probably a better method for doing this but my existing system seems to work quite well. The table below shows my algorithm running on a few sample email addresses. I consider an email with a score of 3 or more to be dodgy.

E-mail Score
phil.hilton@markov-email.com 0
bill.gates@microsoft.com 0
sdfioghsjfkg@gmail.com 3
tracy93@wow-markov.net 0
pzrjmt@yahoo.com 4
gquixdmd@yahoo.com 3
svcmgr1461@yahoo.com 3
hjjjh_hjjh@yahoo.com 7

This method isn’t fail-proof but it is pretty good at detecting bad email addresses and you could use it along with additional checks on the users account to detect fraudulent activity. There will be some false positives, mainly with people who use email addresses which heavily rely on their initials and I’m sure its only a matter of time before the people start committing the fraud start using Markov compliant email addresses.

Download my code There are 2 main files, markov.php which contains example code and markovChain.dat which contains a pre-calculated Markov chain.

Defeating open proxy servers

Thursday, August 20th, 2009

I’ve recently been in a situation where lots of users were abusing a website using a series of open proxies. They were using these open proxies to commit large volumes of fraud. A static list of known proxies can help to combat this issue but you end up fighting a loosing battle trying to keep the list up to date.

I’m fighting back – new users of the service who want to buy items get their computer port scanned as part of the payment process. I only check the ports that proxies are known to run on, 8080, 3128, 1080, 3124, 3127 and 3128. If any of these ports are open the server adds a note to their payment and a human reviews the purchase before the payment is taken.

Its not been running long and I’m not exactly sure if its legal (the T.O.S. have had to be updated) – either way it’ll be interesting to see how effective it is in combating abuse from open proxy servers. I think it could, and probably will end up as an arms race between me and the fraudsters. I’ll keep people posted and let you know if it works out.

Mitigating the insider threat

Thursday, June 25th, 2009

If you look at the number of hacking incidents that have been reported 58% of the incidents are known or suspected to have come from outsiders, 27% from insiders, and 15% from an unknown origin.

That is to say it is the very employees of an organisation are responsible for about 30% of the over all hacks. Disgruntled employees, in particular system administrators are in a prime position to sabotage their former businesses and with the onset of the recession the number who might be tempted to take data with them (or even worse, cripple the system) when they leave is ever increasing.

The threat from insiders is far more dangerous than that of an external hacker – insiders know how the system works and are in an excellent position to cause chaos and then expertly cover their tracks.

What can we do? Well if you do have to make someone redundant or need to fire them – make sure they don’t see it coming so they have no time to prepare and no time to retaliate. When they are in the bosses office hearing the news you need to be disabling their user account and all of their access to the system. If you don’t do this then you risk a major security breach.

In an ideal world each user should only have access to the data that they need in order to do their job. Other methods such as two person control should also be in place for important tasks such as removing money or making external payments. System administrators should review each others logs on a regular basis to ensure nothing untoward is occuring.

Insider threat is very real and cannot afford to be dismissed.

Problems with DKIM keys and PostFix

Friday, June 19th, 2009

If you don’t know, DKIM keys are the replacement for Yahoo!’s Domain Keys that were introduced to combat spam. Its basically a digital signature in the header of the email message to enable the mail server to determine the message source accurately.

I’ve been trying to get dkimproxy.out to work with postfix – which I’ve managed to do. The only issue is that it doesn’t seem to be signing the messages correctly – not quite sure whats wrong.

Delivered-To: xxx.xxxxx@gmail.com
Received: by 10.103.243.5 with SMTP id v5cs118747mur;
Fri, 19 Jun 2009 11:18:43 -0700 (PDT)
Received: by 10.210.30.10 with SMTP id d10mr1099509ebd.14.1245435522990;
Fri, 19 Jun 2009 11:18:42 -0700 (PDT)
Return-Path:
Received: from idpd.vm.bytemark.co.uk ([80.68.93.52])
by mx.google.com with ESMTP id 6si6760399ewy.54.2009.06.19.11.18.42;
Fri, 19 Jun 2009 11:18:42 -0700 (PDT)
Received-SPF: pass (google.com: domain of test@idontplaydarts.com designates 80.68.93.52 as permitted sender) client-ip=80.68.93.52;
Authentication-Results: mx.google.com; spf=pass (google.com: domain of test@idontplaydarts.com designates 80.68.93.52 as permitted sender) smtp.mail=test@idontplaydarts.com; dkim=neutral (bad format) header.i=test@idontplaydarts.com
Received: from localhost.localdomain (localhost.localdomain [127.0.0.1])
by idpd.vm.bytemark.co.uk (Postfix) with SMTP id 82728722DD
for
; Fri, 19 Jun 2009 19:19:01 +0100 (BST)
DKIM-Signature: v=1; a=rsa-sha1; c=relaxed; d=idontplaydarts.com; h=
subject; s=selector1; bh=uoq1oCgLlTqpdDX/iUbLy7J1Wic=; b=X6q/deT
OiqL1ea8qZiP3qsIKDmoWTdlt4Zgd36FfY3kAhLv1JZf1q6h93REQLqLl
subject: Hello world
Message-Id: <20090619181901.82728722DD@idpd.vm.bytemark.co.uk>
Date: Fri, 19 Jun 2009 19:19:01 +0100 (BST)
From: test@idontplaydarts.com
To: undisclosed-recipients:;

Hey there test!!

I’ve checked the DKIM entry on the TXT records – it seems to be accurate and the encryption appears to be working (according to the mail.log output). Anyone got any ideas why I’m getting this “bad format” in the header?  I’m guessing its something to do with the message header being incorrect….

So far I have:

  • I tried reducing the size of the key – down to 384-bits from 1024-bit
  • Changing the selector name

Any ideas anyone?

Update: So it appears that my crude method of sending emails using

telnet localhost 25
MAIL FROM:test@idontplaydarts.com
RCPT TO:xxx.xxxx@gmail.com
DATA
Subject: woot
hello world
.

Is a little crude and missing the To and From headers after the DATA – turns out you need to specify them. *doh* – Its all working fine now, just going to increase the size of the keys now back to 1024bits.

Breaking a CAPTCHA – rules for good design

Sunday, June 7th, 2009

A CAPTCHA is a challenge and response test used to verify that the end-user is a human and not a computer – CAPTCHA is an acronym standing for a Completely Automated Public Turing test to tell Computers and Humans Apart.

Captchas seem to have become increasingly popular as a method to prevent the submission of spam and automated responses. You can see a captcha generated by the popular reCAPTCHA service when you post a comment on this blog.

The main problem with the Captcha is that sometimes the people who implement them are lazy or have no knowledge about how create an image that a computer would find hard to decode. Captchas must be generated server side and over the last few months I have seen an increase in the number of client-side captchas generated by software such as Adobe Flex. If you generate a Captcha client side it is not secure.

When designing a Captcha its important to understand what computers find it hard to do.

Its hard for a computer to segment an image. Computers need to segment an image in order to classify each character of text. Anything you can do – such as running letters together – that makes the image harder to segment will make it much harder for a computer to segment your image.

Lets look at the following example to see how easy it is to segment the text from a badly designed captcha.

Image segmentation in 3 steps, (1) Capture the image, (2) Apply thresholding (3) Convolution Matrix

Image segmentation in 3 steps, (1) Acquire the image, (2) Apply thresholding (3) Apply a simple Convolution Matrix

Two very simple algorithms have been applied to the Captcha above. Firstly thresholding and secondly a convolution matrix to remove the vertical and horizontal lines. If we look at the captcha below we can see that some captchas can be segmented just by using thresholding alone.

Even worse - The background can easily be removed by just simple thresholding

Even worse - The background can easily be removed by just simple thresholding

Once the image has been segmented the computer then has to then classify each character. The more options there are for each character the higher the chance of the computer classifying the letter incorrectly – so when your designing your captcha it pays to use the entire alphabet and not restrict yourself to just numbers or letters – It follows that the longer your captcha the more chance there is of the computer making a mistake in classification. Google makes its captchas between 8 and 11 characters in length.

Both of the captchas we have seen so far are easy to segment – they are also easy to classify. This is due to the similarity of the characters within the image (both 4’s look the same in the second captcha) – if we want to make it tricky for the computer to classify each character we need to use different fonts for each character or warp each character by applying rotation or other image morphing operators. The following Captcha from Yahoo! is much harder to segment due to there being little space between each character, the captcha also uses both upper and lower case characters and has been morphed so that the string is harder to classify.

A set of captchas from Yahoo! that are very hard to segment

A set of captchas from Yahoo! that are hard to segment and classify

Sadly as captchas become harder for computers to read they also become harder for humans to read – there is a fine line between providing the necessary security and frustrating a user with a captcha that is impossible to read.

Google has an interesting solution to this using Markov Chains – here random strings that appear to be words are generated using a statistical method known as a Markov Chain. These words are much easier for a human to read because they seem to be a normal word, however they are not words and this is important. If dictionary words were use then a dictionary could be introduced to improve captcha classification rates.

Google captchas use markov chains to make them easier to read

Google captchas use Markov chains to make them easier to read

Its pretty easy to design and write a good captcha using PHP GD or something similar. If you cant be bothered to write a captcha then services such as reCAPTCHA exist which can provide you with an effective captcha solution (although this is vulnerable to the “penis flood attack“)

No captcha will ever be 100% secure, rumor has it that even google’s captcha has been broken with a classification rate of 20%; there are even stories of captcha sweatshops emerging around the world where people are paid  to solve Captchas – a kind of mechanical Turk.

As algorithms become more sophisticated an alternative to captchas need to be found but until these have been found you may as well make sure that your captchas are secure.