The robots.txt file is a standard way to request that specific bots not scrape your site. But some bots are said to ignore that request and scrape anyway.
In order to avoid being scraped, smaller webservers may wish to resort to more creative measures.
Assumptions
This document assumes you’re serving your site using Caddy 2.x, and configuring Caddy using a Caddyfile. For example, my Caddyfile looks something like this:
average.name {
reverse_proxy :8080
}
In short, this tells Caddy to enforce HTTPS for the average.name
domain, and
to manage a reverse proxy to another local HTTP webserver running adjacent to Caddy on port
8080. There are other reverse proxy softwares out there that can do the same thing, but I
use Caddy and so does this document.
Be sure to use your own domain name in place of
average.name
for the purpose of this tutorial.
Step 0: Have a robots.txt file
The first step, of course, is to politely request that certain bots not scrape your site. If they respect that request, then your webserver can avoid doing some extra work!
My site serves a robots.txt file that borrows heavily from the one at seirdy.one. (You might consider borrowing from Codeberg’s robust one as well.) I would like bots to respect this file. Unfortunately, some are known not to do that. So, as a fallback, we’ll take a more heavy-handed approach.
Step 1: Define a Regular Expression (Regex) that lists the bad bots
This expression is constructed from the list of User-Agent
entries in my
robots.txt file which have Disallow: /
set:
Adsbot|peer39_crawler|TurnitinBot|NPBot|SlySearch|BLEXBot|CheckMarkNetwork|BrandVerity|PiplBot|MJ12bot|ChatGPT-User|GPTBot|Google-Extended|Applebot-Extended|Claude-Web|anthropic-ai|ClaudeBot|FacebookBot|meta-externalagent|AI2Bot|Amazonbot|Bytespider|cohere-ai|Diffbot|facebookexternalhit|FriendlyCrawler|ICC-Crawler|ImagesiftBot|img2dataset|OAI-SearchBot|Omgili|Omgilibot|PerplexityBot|PetalBot|Scrapy|Timpibot|VelenPublicWebCrawler|YouBot
These I’ve asked politely in robots.txt not to crawl my site at all. If they proceed anyway, we’ll have a special treat for them >:3
Step 2: Define a matcher
In your Caddyfile, inside your site block, construct a
named matcher. Use
header_regexp
to match requests whose
User-Agent
header matches your regex from Step 1. The matcher should also omit the
/robots.txt
path specifically, as we still want to serve our polite request to
bad bots.
average.name {
@badrobots {
# Bots that self-report with one of these User-Agent strings are matched:
header_regexp User-Agent Adsbot|peer39_crawler|TurnitinBot|NPBot|SlySearch|BLEXBot|CheckMarkNetwork|BrandVerity|PiplBot|MJ12bot|ChatGPT-User|GPTBot|Google-Extended|Applebot-Extended|Claude-Web|anthropic-ai|ClaudeBot|FacebookBot|meta-externalagent|AI2Bot|Amazonbot|Bytespider|cohere-ai|Diffbot|facebookexternalhit|FriendlyCrawler|ICC-Crawler|ImagesiftBot|img2dataset|OAI-SearchBot|Omgili|Omgilibot|PerplexityBot|PetalBot|Scrapy|Timpibot|VelenPublicWebCrawler|YouBot
# The matcher does not catch if the request is for robots.txt:
not path /robots.txt
}
# ...
}
Step 3: Define behavior for bad bots
Now, use the matcher somewhere. This example uses the
respond
directive to tell Caddy to serve only the string :3
to bad bots.
average.name {
@badrobots {
# Defined in Step 2...
}
respond @badrobots ":3"
# ...
}
Alternatively, you might consider using the
redir
directive to
redirect bots to some very large file hosted elsewhere. It’s up to you what you do.
Result
If you’ve configured Caddy correctly, then normal users will get normal website:
curl https://average.name/
<!DOCTYPE html>
...
And bots will get silliness:
curl https://average.name/ -A "GPTBot"
:3
These bots may avoid silliness by reading and respecting your robots.txt file:
curl https://average.name/robots.txt -A "GPTBot"
User-agent: *
Disallow: /api/*
...
Disclaimers
Unfortunately, this method only works when scrapers reliably self-report their User-Agent string consistently. Some sneaky ones might send a different string, or not send one at all.
Reusable Snippet
If your Caddyfile defines multiple websites, you might consider wrapping your bot-blocking
logic in a
snippet
and reusing with
import
, rather than defining the matcher in each server block:
# Robots that ignore robots.txt get a fun treat :3
(block_bad_bots) {
@badrobots {
# We ask these bots in robots.txt not to proceed
header_regexp User-Agent Adsbot|peer39_crawler|TurnitinBot|NPBot|SlySearch|BLEXBot|CheckMarkNetwork|BrandVerity|PiplBot|MJ12bot|ChatGPT-User|GPTBot|Google-Extended|Applebot-Extended|Claude-Web|anthropic-ai|ClaudeBot|FacebookBot|meta-externalagent|AI2Bot|Amazonbot|Bytespider|cohere-ai|Diffbot|facebookexternalhit|FriendlyCrawler|ICC-Crawler|ImagesiftBot|img2dataset|OAI-SearchBot|Omgili|Omgilibot|PerplexityBot|PetalBot|Scrapy|Timpibot|VelenPublicWebCrawler|YouBot
# Always send robots.txt, even to bad bots
not path /robots.txt
}
respond @badrobots ":3"
}
average.name {
import block_bad_bots
reverse_proxy :8080
}
For best results, be sure to only do this to bots that are actually named in all of your webservers’ robots.txt files, otherwise your webserver will be rude to nice bots and do extra work!