Referrer spam

You may have noticed in your web server logs that there are a few visitors that seem to come from crappy websites, this is sometimes called referrer spam. This is profitable for a couple of reasons:

  • Some people publish their statistics
    dodgy webmasters are happy as they get more links to their sites. Some have added "no rel" or things like that, but still, if the logs are published so are the domain names/urls.

  • Google analytics
    Half of the Internet has voluntarily installed a Trojan Horse on their websites, it's called "Google Analytics" and it basically sends all your visitors footprints to Google exceptions apply [1]. I cannot say how Google measures it up, but I am going to guess that this might still be profitable to our dodgy webmasters web site ranking.

Some have started a blacklist of domain names or even IPs that produce referrer spam, I think any blacklist, especially like this one is really bad. For one, if a domain name blacklist became popular then the dodgy spammer could/would start referring competitor's websites. Public blacklists are bad. Always. If they are used then they can quickly get out of control (that's all another story).

For me a simple idea is to maintain my own private list of dodgy domains that I notice in my logs and reject visitors that are referred by them. I'm sure it's not the most efficient thing ever, but who cares, it's mostly to clear the statistics a bit and have a laugh with those that fake referrers manually.

This is a sample of the code I have written, as you can see it is extremely simple:

// List spam referrer domains $bad_referrers = array ( 'dodgywebsite.tld', 'someotherbadwebsitethatlikeslongdomainnames.tld', ); // check if the referrer is "bad" foreach ($bad_referrers as $v) { if (preg_match("#^https?://(w{3}.)?$v#", $_SERVER[HTTP_REFERER])) { // if so, redirect the visitor to their fake referrer.. LOL header("Location: http://$v"); } }

That should redirect bad referrers to the site they are referring to. The regular expression takes care of an eventual "www." and http|s so you do not need to list the protocol or the standard www subdomain.

[1] Google analytics wont work on those who don't execute javascript or have a special hosts file that includes: