Mathieu Brunot

Home

Web crawlers

Published May 21, 2012

Web crawlers: what are they and how to protect your site from them

Introduction

Indexing robots (web wanderers, web crawlers or web spiders) are programs which will, as their name suggests, index the Web. These programs will reference the content of your site: text, images, addresses, etc… This is how Google indexes the Web.

It is possible that you do not want those to search your site for various reasons (to save bandwidth, “private” data, …). We will therefore see how to block robots (or at least restrict their field of action).

robots.txt

Before indexing your site, conventional robots read a file called robots.txt. This file allows you to specify which robot has the right or not to index which part of your site. It must imperatively be placed at the root of your site. It’s a simple text file, which you can create with any text editor (Notepad, Notepad++, vim, …).

The robots.txt file contains two essential directives:

  1. One to identify robots: User-agent
  2. One to remove a particular folder from indexing: Disallow

A third directive is interpreted by some robots (including Google), and allows to authorize a robot to index part of the site: Allow.

Examples

Denying all crawlers to index your site’s root:

User-agent: \*
Disallow: /

Denying all crawlers to index your site’s private folder:

User-agent: \*
Disallow: /private/

Block the Google Image crawler from indexing your site:

User-agent: Googlebot-Image
Disallow: /

Block all bots except Google:

User-agent: \*
Disallow: /
User-agent: Googlebot
Allow: /

Meta tag robots

Another way to proceed is to place in the head of the HTML page (<head>) a <meta> tag specific to robots. Those 2 methods can be used

<meta name="robots" content="noindex,nofollow" />

This tag accepts 2 values:

Value Description
(no)index Indexing or not of the page. Possible values: index et noindex
(no)follow Link recovery and tracking. Possible values: follow et nofollow

Examples

<meta name="robots" content="noindex,nofollow" />
<meta name="robots" content="index,nofollow" />
<meta name="robots" content="noindex,follow" />
<meta name="robots" content="index,follow" />

Bad Bots

However, not all robots are benevolent. The bad robots or bad bots will ignore your instructions and collect information on your site for a dubious purpose :sweat:… So, it is useful to know how to ban these bad bots.

This is done via the configuration of the Apache server thanks to the .htaccess. By using the possibility to rewrite the exclusion of certain clients, one can block known bad bots.

Here is a proposed .htaccess from A Close to perfect .htaccess ban list blocking most known bad bots:

# Blocking bad bots and site rippers
<IfModule rewrite_module>
    RewriteEngine on
    RewriteBase /
    RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
    RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
    RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
    RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
    RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
    RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
    RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
    RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
    RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
    RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
    RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
    RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
    RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
    RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
    RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
    RewriteCond %{HTTP_USER_AGENT} ^Zeus
    RewriteRule ^.* - [F,L]
</IfModule>