Archive for August, 2009

Detecting a fake email address using Markov chains

Saturday, August 22nd, 2009

Markov chains are a set of states where any state is only dependant on the previous state. These can be used to generate “real-looking” words from a given set of text. By the same methods we can decide if a string is a valid word or a load of garbage by assessing each letter and its subsequent letter in word. If the probability of letter N+1 coming after N is very small then we can probably say that the chance of the string being a word is very small.

When users sign up with a fake email address they tend not to put much thought into the name of the email. Something like sdfjsldkf87we@example.com is a good example. To filter these email addresses out we can take a dictionary and calculate the probability of the next letter (N+1) given the previous letter (N) and compare this to what we observe in the fake email address. If the probability of the next letter is repeatedly low then we can say that the email address is probably fake.

My algorithm scores each email, giving it a point each time a letter N+1 should never come after letter N and reducing the score by 1 for every 12 characters in the email address. This additional check helps to reduce the number of false positives. I only check the initial part of the domain – that is the part excluding the @example.com

You’ll probably wonder how the code deals with non alpha-numeric numbers? I just strip them out and convert the whole email to lower-case. There is probably a better method for doing this but my existing system seems to work quite well. The table below shows my algorithm running on a few sample email addresses. I consider an email with a score of 3 or more to be dodgy.

E-mail Score
phil.hilton@markov-email.com 0
bill.gates@microsoft.com 0
sdfioghsjfkg@gmail.com 3
tracy93@wow-markov.net 0
pzrjmt@yahoo.com 4
gquixdmd@yahoo.com 3
svcmgr1461@yahoo.com 3
hjjjh_hjjh@yahoo.com 7

This method isn’t fail-proof but it is pretty good at detecting bad email addresses and you could use it along with additional checks on the users account to detect fraudulent activity. There will be some false positives, mainly with people who use email addresses which heavily rely on their initials and I’m sure its only a matter of time before the people start committing the fraud start using Markov compliant email addresses.

Download my code There are 2 main files, markov.php which contains example code and markovChain.dat which contains a pre-calculated Markov chain.

Defeating open proxy servers

Thursday, August 20th, 2009

I’ve recently been in a situation where lots of users were abusing a website using a series of open proxies. They were using these open proxies to commit large volumes of fraud. A static list of known proxies can help to combat this issue but you end up fighting a loosing battle trying to keep the list up to date.

I’m fighting back – new users of the service who want to buy items get their computer port scanned as part of the payment process. I only check the ports that proxies are known to run on, 8080, 3128, 1080, 3124, 3127 and 3128. If any of these ports are open the server adds a note to their payment and a human reviews the purchase before the payment is taken.

Its not been running long and I’m not exactly sure if its legal (the T.O.S. have had to be updated) – either way it’ll be interesting to see how effective it is in combating abuse from open proxy servers. I think it could, and probably will end up as an arms race between me and the fraudsters. I’ll keep people posted and let you know if it works out.

Caching wordpress as static HTML using APC

Monday, August 17th, 2009

While trying to speed up my wordpress installation I noticed that there seemed to be a lot of plugins that generate a static home page – they all used complex methods to store the files, checking the last modified times and implementing complex checking methods to see if the content has changed.

It struck me as strange that no one had suggested using APC cache – I don’t know if this is because its not available on all commercial hosting packages (such as dreamhost) or if its just due to lack of knowledge about the apc_store and apc_fetch commands. APC doesn’t just cache PHP opcode, it can cache user defined PHP variables as well.

I’ve come up with a small script that is capable of caching my PHP pages generated by wordpress. It uses the APC cache to store and manage the page data and it guarantees that the page will never be more than a specified number of seconds out of date.

In order to use this plugin you only have to make 2 changes go index.php in the wordpress root. Firstly you place this code at the top of your index.php


<? 
   
if (empty($_POST)) {

        $cache_key md5($_SERVER['REQUEST_URI'] .serialize($_GET));
        
$cache_data apc_fetch($cache_key$cache_result);

        if ($cache_result == true) {
            echo(
$cache_data);
            die;
        }

        ob_start();

    } 
?>

And then place this code at the base of index.php


<?
   
if (empty($_POST)) {
    
$cache_data ob_get_clean();
    
apc_store($cache_key$cache_data300);
    echo(
$cache_data);
   } 
?>

The code works by creating a MD5 hash of the URL and the GET request – this is important as it means that pages that take variables as arguments, such as search, will create their own cached pages. We don’t use the cache if there are POST variables present as we can probably assume that the output of the page is very dynamic. If you don’t want to cache pages that use a GET request as well you could use something like this

if ((empty($_POST)) && (empty($_GET)) {

It would be unwise to cache separate pages for each set of POST data as POST data is generally much larger than data sent via GET.

We use the MD5 of the URI and GET parameters to act as a key for our storage, if we find the key in memory and it hasn’t expired, we can use the data retrieved from the key to display the page – this bypasses all the MySQL and other PHP that was needed to create the original page.

Wordpress Static vs Dynamic benchmarks

If we now look at some benchmarks of before and after we can see the difference. I’ve use ab (that comes with apache) to perform the tests

Without the caching

Requests per second: 8.52 [#/sec] (mean)
Time per request: 1173.343 [ms] (mean)
Time per request: 117.334 [ms] (mean, across all concurrent requests)
Transfer rate: 212.85 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.1 0 1
Processing: 579 1161 104.4 1158 1464
Waiting: 234 465 82.6 449 756
Total: 579 1161 104.4 1158 1464
Percentage of the requests served within a certain time (ms)
50% 1158
66% 1186
75% 1225
80% 1238
90% 1289
95% 1315
98% 1407
99% 1413
100% 1464 (longest request)

With caching (tests done on non primed cache)

Requests per second: 136.07 [#/sec] (mean)
Time per request: 73.490 [ms] (mean)
Time per request: 7.349 [ms] (mean, across all concurrent requests)
Transfer rate: 3391.63 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.6 0 4
Processing: 10 72 32.2 77 202
Waiting: 5 32 25.0 26 196
Total: 11 72 32.2 77 202
Percentage of the requests served within a certain time (ms)
50% 77
66% 89
75% 93
80% 98
90% 109
95% 120
98% 136
99% 142
100% 202 (longest request)

You can see that without the caching that the server can serve 10 requests per second and after the caching it can serve 136 requests per second. That’s a 13 times increase in speed. Not bad for a server with only 256mb of RAM and a single core CPU. I’m pretty sure the server could handle even more load but at this point the upstream connection became saturated.

There are however some pitfalls with caching. Currently when users are logged in we cache their page, we need to add a condition to prevent this caching. Its not a problem on my server however as I am the only personal allowed to blog. Other pitfalls are the freshness of data, we can set this to whatever we want (from 1 second to infinity) but its never going to be as “fresh” as a page that isn’t cached.

Caching isn’t for everyone but if you do want your site to run faster, can afford to have some slight inconsistencies in your data and don’t mind waiting a few minutes for a comment to appear then you can achieve some really good results.

Apache 2.2 proxy and LightTPD

Monday, August 17th, 2009

The server has just undergone some modifications – previously I was using Lighttpd to serve all of the content, I liked the traffic shaping features and low memory footprint that it had. But I also sorely missed the mod_rewrite functionality and mod_php that was provided by Apache.

The solution was simple. Apache 2.2.13 to serve the Dynamic PHP files and Lighttpd to serve the static files via the Apache Proxy plugin. This results in gaining all the features of Apache but only when I need them; using Lighttpd to serve static content.

The basic setup is simple. Lighttpd runs on port 81, 127.0.0.1 and Apache runs on port 80 of idontplaydarts.com, both point to the same root directory and when Apache sees a request for a file located in either wp-content or wp-includes it instructs lighttpd to handle it. My config file looks something like this.

ProxyPass /wp-content http://127.0.0.1:81/wp-content
ProxyPassReverse /wp-content http://127.0.0.1:81/wp-content

ProxyPass /wp-includes http://127.0.0.1:81/wp-includes
ProxyPassReverse /wp-includes http://127.0.0.1:81/wp-includes

The only issue at the moment is that the latest version of Apache doesn’t yet support the ProxyPassMatch directive. This would let me specify a regular expression such as *.txt to tell Apache to pass all the requests for text files to Lighttpd.

ProxyPassMatch ^(/.*\.txt)$ http://idontplaydarts.com/$1

PassProxyMatch is due to be introduced in Apache 2.2.5, we’re only Apache 2.1.3 at the moment so there is going to be a bit of a wait before I can change my configuration files and allow support for regular expressions with PassProxy.

Its worth mentioning that you can do the proxy the other way round, lighttpd front passing it to Apache but there is not much benefit and you dont get to take advantage of the nice Apache rewrite rules