Block AI Scrapers with ingress-nginx
March 03, 2024 -robots.txt
is a long established standard to control well-behaved1 web indexers (such as Google) for a site.
Additionally, the <meta name="robots">
HTML tag and X-Robots-Tag
HTTP header serve
similar roles as robots.txt
but for a specific page/request.
Since I use ingress-nginx for routing on my home Kubernetes cluster, the X-Robots-Tag
header is the most straightforward way I came up with to ensure that any content I'm serving is opted-out of ML training.
This same approach should work with Traefik, Caddy, and most other common HTTP servers.
ingress-nginx has a built-in add-headers
option to set2 a header on all outbound responses.
The Helm chart makes this straightforward with a small addition to values.yaml
:
# ingress-nginx :: values.yaml
controller:
addHeaders:
X-Robots-Tag: noai
This is probably good enough and works with pretty much any version of ingress-nginx, so feel free to stop here.
I have trust issues...
It's not clear to me who is actually respecting the generic noai
rule, so I want to additionally include specific rules for the known major AI crawlers.
Thankfully, the X-Robots-Tag
header can be repeated to provide multiple rules, so what I really want is:
x-robots-tag: noai
x-robots-tag: Google-Extended: none
x-robots-tag: GPTBot: none
x-robots-tag: ChatGPT-User: none
x-robots-tag: anthropic-ai: none
x-robots-tag: CCBot: none
[This list was last updated March 3, 2024.]
ingress-nginx uses headers-more-nginx-module, which added support for appending headers with more_set_headers -a
in v0.363.
Unfortunately, we can't use ingress-nginx's wrappers (add-headers
/ addHeaders
), since they only support a single value per-header.
All is not lost, however! You do need to be running ingress-nginx ≥ 1.10.0.
Using a custom nginx config snippet, we can call the nginx module directly:
# ingress-nginx :: values.yaml
controller:
config:
http-snippet: |
more_set_headers -a "X-Robots-Tag: noai";
more_set_headers -a "X-Robots-Tag: Google-Extended: none";
more_set_headers -a "X-Robots-Tag: GPTBot: none";
more_set_headers -a "X-Robots-Tag: ChatGPT-User: none";
more_set_headers -a "X-Robots-Tag: anthropic-ai: none";
more_set_headers -a "X-Robots-Tag: CCBot: none";
[This list was last updated March 3, 2024.]
Once deployed:
❯ curl -sS -D - https://ping.readygo.run/ -o /dev/null
HTTP/2 200
date: Sun, 03 Mar 2024 19:46:51 GMT
content-type: application/json; charset=utf-8
content-length: 393
x-content-type-options: nosniff
x-robots-tag: noai
x-robots-tag: Google-Extended: none
x-robots-tag: GPTBot: none
x-robots-tag: ChatGPT-User: none
x-robots-tag: anthropic-ai: none
x-robots-tag: CCBot: none
Smell ya later, 🤖s!
All configuration examples on this page marked as CC0.4
This qualifier is important: it's a request for the other party to respect the opt-out
Despite the naming, this option will overwrite (not append) any existing header of the same name in the response
Thanks to the headers-more-nginx-module
maintainers for their prompt response to my feature request ❤️
I don't think they're copyrightable anyway, but I'm not a lawyer 🤷