Blog website crawlers and bots in Apache2

Posted by ads on Sunday, 2021-03-21
Posted in [Online][Software]

Found a couple more bots crawling my website, and from the look at online resources it seems I catched a few of the bad guys. Crawlers which ignore the robots.txt standard, and just crawl a website for content.

Decided to do something against it, and added a filter in Apache2.

The way I have my webserver setup is that I have templates for every website (they all have different configs), and deploy them using Ansible. Parts of the website configuration which are the same, or at least similar, are handled by includes.

Step 1: Identify bots

The first step is to find the bots in the logfiles. Every website writes into it’s own logfile here, the following line aggregates all of the current logfiles, extracts the user agent string, and sorts by number of occurences:

cd /var/log/apache2
awk -F\" '($2 ~ "^GET /"){print $6}' *.log | sort | uniq -c | sort -n | tac | less

At the top of the list you find the most common user agents. Keep in mind that some crawlers might hide behind a regular user agent string.

Step 2: Create filter rules

Extract the common pattern from the bots you want to block, usually that’s the name of the bot. Create¬†RewriteCond rules:

RewriteCond "%{HTTP_USER_AGENT}" "^.*BOT1.*$" [OR]
RewriteCond "%{HTTP_USER_AGENT}" "^.*BOT2.*$" [OR]
RewriteCond "%{HTTP_USER_AGENT}" "^.*BOT3.*$" [OR]
RewriteCond "%{HTTP_USER_AGENT}" "^.*BOT4.*$"

The [OR] at the end of every line - except the last one - tells Apache that it’s either this line OR this line OR this line. Otherwise Apache assumes that all lines are AND.

And finally decide what to do with the requests from the bots:

RewriteRule "." "-" [R=403,L]

In my case I just block the request. More subtle options are also possible.

Step 3: Deploy everything

As mentioned before, configs are automated here. In the directory where I have the webserver configs I created an additional file robots.j2. The .j2 ending is the common ending for Jinja2 templates, but the prefix does not really matter. The content of the file:

	# block some robots
	RewriteEngine On
	RewriteCond "%{HTTP_USER_AGENT}" "^.*BOT1.*$" [OR]
	RewriteCond "%{HTTP_USER_AGENT}" "^.*BOT2.*$" [OR]
	RewriteCond "%{HTTP_USER_AGENT}" "^.*BOT3.*$" [OR]
	RewriteCond "%{HTTP_USER_AGENT}" "^.*BOT3.*$"
	RewriteRule "." "-" [R=403,L]

And in my webserver configuration files I include the following line:

{% include 'robots.j2' %}

The files are rolled out using the Ansible template module, and during deployment the robots.j2 is included into each config file.

After changing the files the webserver needs to be restarted (just the Apache2 service, not the entire server). That’s managed by a notify handler in my Playbook.

By using an external include file I only need to change this one file, and then re-deploy the configuration. Another way to do this is to use the Include directive in Apache2 - however I decided against it: by using a template in Ansible I keep the door open for more elaborate configurations, even based on a per-webserver basis and such.

Step 4: Profit - - [21/Mar/2021:16:44:59 +0100] "GET /blog/archives/2018/12/C67.html HTTP/1.1" 403 4668 "-" "Mozilla/5.0 (compatible; BOT4/1.0)" - - [21/Mar/2021:16:45:03 +0100] "GET /blog/archives/2018/12/C67/summary.html HTTP/1.1" 403 4668 "-" "Mozilla/5.0 (compatible; BOT4/1.0)"

Categories: [Online] [Software]