As a website owner, you probably have at least a few good reasons to block bots and scrapers. Scrapers steal your content and unruly bots can do anything from eating your bandwidth to trying to hack into your site.
As a forum or community owner, you may also have reasons to block proxies. Proxies are what gives many trolls, fakes, assholes, idiots, jerk-offs, and other pitiful people in general, their false bravado. For some reason, these “tech experts” that have the elite skills to be able type the words “free proxy” into Google, or figure out how to install a TOR client, grow giant balls when they think you can’t track them down to their real IP address. Give this kind of anonymity to these socially unbalanced people (that’s a nice way of saying losers in real life, or people that forget to take their meds) and they suddenly become “tough guys” with no fear to wreak havoc in your community. BUT, take away their proxy, force them to log-in from home or work and they suddenly become able to follow the rules or more likely are too chicken to do or say anything and alas, they go away! If they DO continue to insist on making themselves feel better (it’s sad, I know) by bullying or causing trouble in your online-community, then one report to their ISP (or, the FBI if they are REALLY going overboard) or employer will usually take care of it. Imagine what mommy and daddy will do when their internet account gets terminated! If they are adults (yes, sadly “adults” do pull this kind of shit), then they’ll have to deal with the hassle of getting a new ISP or deal with mommy and daddy if they live with their parents in the basement (a common trait of internet trolls). If reporting them doesn’t help, you can ban their IP and have no worries that they’ll just come right back via a proxy. Sure, since you can never block 100% of the proxies out there, they may still find a proxy that works, but as your proxy blocking skills grow, eventually it will become too much hassle for all but the most pitiful of trolls or assholes and they’ll give up and go get their kicks bothering some other community.
So here are a few updated tips for blocking bots, scrapers, and proxies (aka trolls and assholes). Much of this is Drupal focused, but much can be applied to any website/blog/forum.
Start with the obvious: The Drupal Troll Module. The Drupal 5.x version of this module had been abandoned several months ago after a critical security flaw was discovered. But after popular outcry it has been updated and is supported again. The Troll module allows you to block IP address and re-direct them to a static HTML page, but it also allows you to search your member database by IP address or email address (very handy in some situations). It supports wildcard searching (just leave the last octet of an IP address blank for example, and it will return all matches) so even tracking down assholes trolls using DHCP is easy. The Troll module will also easily show you every IP address that a member has ever signed-in with (User|Troll Track) and the domain name. A member using a legit IP will show a history from the same address or ISP, whereas someone using a proxy will show as coming from many different locations and domains. After you’ve looked at a few IP histories, the proxies will stand-out like a sore thumb. You can then block those IP’s using the Troll module or your IPTables firewall.
Next on the list is BadBehavior. If you use Drupal, you need to install the Drupal BadBehavior module and the BadBehavior script. If you use WordPress, you need only the script. BadBehavior can also be modified to work with virtually any PHP based website/forum. BadBehavior blocks almost all automated bots, scrapers, and spammers – and if used in combination with something like Akismet or Mollom, spam becomes almost a non-issue. When put in “strict mode” BadBehavior blocks many (but not all) proxies, and is a great first-line of defense, but you can also use information from Bad Behavior with CSF/IPTables firewall to locate Proxy/Server farms and block them en-masse.
Now for the big guns: The IPTables Firewall. IPTables allows you to block individual IP address or CIDRs (entire ranges of IPs) from accessing your website/server but instead of simply re-directing blocked address to a static page at the domain-level like TROLL does, IPTables/CSF “drops” all the packets, leaving the troll/asshole/proxy user nothing but an “unable to connect” error. IPTables is very powerful, and almost by definition that makes it difficult to use. Because of that, I recommend using CSF Firewall which is almost a GUI for IPTables and also adds some great additional features. To use IPTables/CSF you need either a VPS or dedicated server with root access. If you are on a shared host and have asshole problems, you might have to put your big-boy pants on and move to a dedicated or VPS server.
Once you get CSF up and running (it’s really not that tough), do the obvious things like activating the Real Time Block Lists (RBLs) and use the CC_Deny setting to block entire countries that you don’t need hanging around your site (North Korea, China, Turkey, Russia, India come to mind).
After you’ve blocked all the undesirable countries with CC_Deny, you can move on to the CSF.DENY file which allows you to block IP’s and ranges of IP address in CIDR format. The first thing you can do is import any IP addresses that you’ve already blocked with the TROLL module – then you can start building your proxy-blocking list.
In building your proxy-block list, you aren’t just blocking proxy servers, you really want to block all servers. There is really no reason for any server other than Google bots, Yahoo, etc, to access your site so blocking any/all ‘server farms’ will protect you not only from assholes using proxies, but also from compromised servers trying to hack your site. The best source I have found for building my block list (now blocking hundreds of thousands of IP’s and several million domains) is the Bad Behavior module (mentioned above). By learning how/why Bad Behavior blocks IP’s you can identify servers and server farms and add them by the thousands to your CSF.DENY file.
What to look for in Bad Behavior: Each time Bad Behavior blocks an IP it logs the IP address and the reason. The following reasons often (not always, you have to be careful) mean that the originating IP belongs to a proxy or a server:
- Header ‘Connection’ contains invalid values
- Required header ‘Accept’ missing
- Prohibited header ‘Proxy-Connection’ present
- Header ‘Referer’ is corrupt
Get the IP address from Bad Behavior identified with one of the reasons above and do a quick WHOIS lookup on it. I like to use http://whois.domaintools.com, but any WHOIS server will do. Usually (not always) a server or proxy will show other sites listed, an SSL cert, etc. For example, look at this WHOIS for 18.104.22.168 . A WHOIS lookup for a regular home ISP connection, or a business won’t show much info at all, for example, look at this WHOIS for this Comcast home user.
So now you have your IP, in our example above, 22.214.171.124, but you dont want to block just that IP, you want to block every server in that entire IP range. To do that, you add the CIDR to your CSF.DENY file in CSF. The example server/proxy above has the following CIDR in it’s WHOIS info:
Address: 141 w jackson blvd.
Address: suite #1135
NetRange: 126.96.36.199 - 188.8.131.52
CIDR: 184.108.40.206/18 <-------------- This is the CIDR
NetType: Direct Allocation
If you aren’t positive this is a server-farm you could visit the domain listed, in this case, FDCservers.net. Their website clearly shows that they are a server hosting company. You could also google the company name or even the IP to dig up more info. Now that you are positive that you want to block this entire range or CIDR of 220.127.116.11/18, simply add it to your CSF.DENY. Sometimes, usually with foreign servers, a CIDR won’t be listed. In a case like that you can still block an entire range of IP’s by using a CIDR Calculator and entering the beginning IP address and the mask or range/number of IP’s to block. I usually block an entire 16-bit range, which for the example above would be 18.104.22.168/16 instead of the CIDR above “/18” which applies only to FDCServers, using “/16” blocks everything that starts with 67.159.
When adding your IP’s or CIDR into CSF.DENY be sure to add “# do not delete” after each entry. Otherwise, once you hit the limit of IP’s specified in your CSF configuration file, older entries will get overwritten with newer entries.
How to block TOR: The Onion Router or TOR is a network of proxies intended to protect the anonymity of internet users. TOR is great for whistleblowers or government protesters, but not so great for website owners trying to keep assholes out of their community. TOR is fairly easily blocked by adding the list of “TOR Exit nodes” into CSF.DENY or TROLL. You can get an updated list of TOR exit nodes here: TOR Exit Node list. TOR is dynamic and the list changes, so you’ll have to update it every few days or so.
How to block Port Proxies or SOCKS proxies: Port or SOCKS proxies are almost always blocked by Bad Behavior
Sometimes you may end up blocking legitimate users, particularity when blocking entire ranges of IP’s – it’s unavoidable. When someone complains, confirm their IP address and just remove them from CSF.DENY or your TROLL list – no big deal. I’ve been using these methods for over a year and I’ve only blocked 10 or so legitimate users (that i know of at least).
If you don’t have/can’t use IPTABLES/CSF, you can also use some of the techniques above to block IP’s and CIDRs in your .HTACCESS file, but I cannot vouch for how well it will perform when the list grows large – and to be effective it needs to be really, really large.
This has turned out to be one of my longest and mostest rambling posts. If I’ve been unclear or if you have any questions, please post a comment. And oh – if you’re reading this via a proxy, post a comment and tell me that my techniques don’t work!