Posts Tagged ‘Email Addresses’

Detecting a fake email address using Markov chains

Saturday, August 22nd, 2009

Markov chains are a set of states where any state is only dependant on the previous state. These can be used to generate “real-looking” words from a given set of text. By the same methods we can decide if a string is a valid word or a load of garbage by assessing each letter and its subsequent letter in word. If the probability of letter N+1 coming after N is very small then we can probably say that the chance of the string being a word is very small.

When users sign up with a fake email address they tend not to put much thought into the name of the email. Something like sdfjsldkf87we@example.com is a good example. To filter these email addresses out we can take a dictionary and calculate the probability of the next letter (N+1) given the previous letter (N) and compare this to what we observe in the fake email address. If the probability of the next letter is repeatedly low then we can say that the email address is probably fake.

My algorithm scores each email, giving it a point each time a letter N+1 should never come after letter N and reducing the score by 1 for every 12 characters in the email address. This additional check helps to reduce the number of false positives. I only check the initial part of the domain – that is the part excluding the @example.com

You’ll probably wonder how the code deals with non alpha-numeric numbers? I just strip them out and convert the whole email to lower-case. There is probably a better method for doing this but my existing system seems to work quite well. The table below shows my algorithm running on a few sample email addresses. I consider an email with a score of 3 or more to be dodgy.

E-mail Score
phil.hilton@markov-email.com 0
bill.gates@microsoft.com 0
sdfioghsjfkg@gmail.com 3
tracy93@wow-markov.net 0
pzrjmt@yahoo.com 4
gquixdmd@yahoo.com 3
svcmgr1461@yahoo.com 3
hjjjh_hjjh@yahoo.com 7

This method isn’t fail-proof but it is pretty good at detecting bad email addresses and you could use it along with additional checks on the users account to detect fraudulent activity. There will be some false positives, mainly with people who use email addresses which heavily rely on their initials and I’m sure its only a matter of time before the people start committing the fraud start using Markov compliant email addresses.

Download my code There are 2 main files, markov.php which contains example code and markovChain.dat which contains a pre-calculated Markov chain.