Who’s Reading Your SiteMap ?

Is your sitemap being used against you by scrapers to steal your content?  Mine was, until I started taking steps to protect it.

A sitemap helps to get your pages indexedWhat is a Sitemap?

A sitemap, simply stated, is a file that lists every page on your website making it easier for search engines to find and index your pages after you make it available to them.  A sitemap can be a simple flat text-file, but these days they are usually dynamically created and updated .XML files.

Contrary to popular misbelief, a sitemap will not increase your search engine rankings and it won’t get your site indexed any faster.   A sitemap will however, help to ensure that all of your pages get indexed by search engines.  This is particularity important for pages that have no links to them.  Ensuring that all of your pages are indexed is an important piece of an overall SEO strategy and although not 100% necessary, most SEO experts recommend submitting a sitemap to at least the Big-3 (Google, Yahoo, and MSN/Live).

Drupal has a great module to generate the XML sitemap for your site, and WordPress has a few plugins also.  If you have a static website or if your CMS or blog does not have a plugin or module for creating a dynamic (always updated) sitemap you can use a site like XML-Sitemaps.com to generate a static sitemap for you.

Once you have your sitemap, submitting it to the Big-3 (or others) is usually pretty easy. Here are the URL’s for submitting your sitemaps to the search engines that you probably care about.  Some might require you to signup for an account, but all are free.  On most of them after you sign up they have diagnostic tools and other free services that can help with sitemap errors as well as reports and information about your index-status, search statistics, and more.  Each has a slightly different process for submitting a sitemap so be sure to read the instructions, but all are pretty easy:

You’ve submitted your sitemap, so what could go wrong?

One day while going through my logs, I saw an IP not beloging to any search engines had downloaded my sitemap.  I started watching my logs more closely, and I was seeing many non-search engine IP’s downloading my sitemap every day – oddly enough (not!), most of these IP’s were from countries like India, China, and Russia – but they were basically coming from everywhere.  After a little more investigation, I found these same IP’s reading hundreds and thousands of my pages – what I realized is that these “scrapers” had downloaded my sitemap, then used it to crawl through and copy every single page on the site.  These scrapers were not only stealing my content for use on some MFA website, but they were also sucking huge amounts of my (very expensive) bandwidth.  This is probably common sense/common knowledge to experienced webmasters, but for me it was one of those “ahh-haa!” moments when I realized what was happening.

There are a few things I do to try and stop these bandwidth-sucking scraper-leeches.  Back when we were still using Drupal 4.7, we used the GsiteMap module for Drupal.   The gsitemap module used a non-standard sitemap name instead of the standard domainname.com/sitemap.xml  path.  Just the fact that the sitemap name was non-standard apparently fooled many scrapers.  So if possible, changing your sitemap name to something other than sitemap.xml will thwart many scrapers.

Since we’ve upgraded to Drupal 5 and started using the newer XML Sitemap module we can’t (easily) change the name of the sitemap so we immediately saw a huge increase in sitemap downloads, and site-scraping.   To combat them, we keep an eye on the logs –  The XML Sitemap Module will record an entry each time the sitemap is downloaded along with the domain-name & IP that downloaded it.  If it wasn’t an IP that belongs to a search engine, I use the Drupal Troll module to block that IP.  For WordPress you could use the Ban plugin, and for any site you could also block the IP in .HTACCESS manually or via cPanel/WHM if you have it.

With Drupal you can also easily see scraping behavior by viewing “Top Visitors” in admin/logs.  You can spot the scraper because it’s the IP that has 10,000 page views (or some other very large number) in the last xx hours.  I verify that these IP’s with unusually high page-reads do not belong to a legitimate search engine, then I ban that IP.  I don’t really worry about banning a real visitor because these scrapers usually have so many more page reads than a regular visitor, they really give themselves away and stand out like a sore-thumb.

I know that doing these things is a bit like spitting into the wind, or like trying to clean a beach one-grain of sand at a time, but it does help – and it makes me feel like I have a bit more control over MY content, MY site, and MY bandwidth.  YMMV ..

6 thoughts on “Who’s Reading Your SiteMap ?

  1. Hi Randy, great post. Something that I have never thought of personally. Mostly because I guess my sites aren’t scrape worthy enough yet.

    But this is something that I will keep in mind once my sites do start growing and any advantage that we can get over the spammers definitely helps.

  2. [quote comment=”1992″]What logs would I be looking at?[/quote]

    If you’re using Drupal, it’s right there in the watchdog logs, and you can filter it so you can only view the Sitemap entries. Other CMS’s probably have similar features.

    Otherwise, you should be able to look through your raw access logs, and filter or search for your sitemap url. Whenever I need to go through my raw logs for something, i download the log to my computer, then use my text editor to search for the string i’m looking for. so in this case i would just search for “sitemap.xml” .

  3. I found your replying to my post at DP, and I am glad I did.

    I will bookmark this post and spend time to read this further more.

    But for the meantime, thanks for this as I am not yet aware about this sitemap issues.

Leave a Reply

Your email address will not be published.