profile picture

Block AI Scrapers with ingress-nginx

March 03, 2024 - kubernetes homelab machine-learning

robots.txt is a long established standard to control well-behaved1 web indexers (such as Google) for a site.

Additionally, the <meta name="robots"> HTML tag and X-Robots-Tag HTTP header serve similar roles as robots.txt but for a specific page/request.

Since I use ingress-nginx for routing on my home Kubernetes cluster, the X-Robots-Tag header is the most straightforward way I came up with to ensure that any content I'm serving is opted-out of ML training.

This same approach should work with Traefik, Caddy, and most other common HTTP servers.

ingress-nginx has a built-in add-headers option to set2 a header on all outbound responses.

The Helm chart makes this straightforward with a small addition to values.yaml:

# ingress-nginx :: values.yaml
controller:
  addHeaders:
    X-Robots-Tag: noai

This is probably good enough and works with pretty much any version of ingress-nginx, so feel free to stop here.


I have trust issues...

It's not clear to me who is actually respecting the generic noai rule, so I want to additionally include specific rules for the known major AI crawlers.

Thankfully, the X-Robots-Tag header can be repeated to provide multiple rules, so what I really want is:

x-robots-tag: noai
x-robots-tag: Google-Extended: none
x-robots-tag: GPTBot: none
x-robots-tag: ChatGPT-User: none
x-robots-tag: anthropic-ai: none
x-robots-tag: CCBot: none

[This list was last updated March 3, 2024.]

ingress-nginx uses headers-more-nginx-module, which added support for appending headers with more_set_headers -a in v0.363.

Unfortunately, we can't use ingress-nginx's wrappers (add-headers / addHeaders), since they only support a single value per-header.

All is not lost, however! You do need to be running ingress-nginx ≥ 1.10.0.

Using a custom nginx config snippet, we can call the nginx module directly:

# ingress-nginx :: values.yaml
controller:
  config:
    http-snippet: |
      more_set_headers -a "X-Robots-Tag: noai";
      more_set_headers -a "X-Robots-Tag: Google-Extended: none";
      more_set_headers -a "X-Robots-Tag: GPTBot: none";
      more_set_headers -a "X-Robots-Tag: ChatGPT-User: none";
      more_set_headers -a "X-Robots-Tag: anthropic-ai: none";
      more_set_headers -a "X-Robots-Tag: CCBot: none";

[This list was last updated March 3, 2024.]

Once deployed:

❯ curl -sS -D - https://ping.readygo.run/ -o /dev/null
HTTP/2 200
date: Sun, 03 Mar 2024 19:46:51 GMT
content-type: application/json; charset=utf-8
content-length: 393
x-content-type-options: nosniff
x-robots-tag: noai
x-robots-tag: Google-Extended: none
x-robots-tag: GPTBot: none
x-robots-tag: ChatGPT-User: none
x-robots-tag: anthropic-ai: none
x-robots-tag: CCBot: none

Smell ya later, 🤖s!


All configuration examples on this page marked as CC0.4

1

This qualifier is important: it's a request for the other party to respect the opt-out

2

Despite the naming, this option will overwrite (not append) any existing header of the same name in the response

3

Thanks to the headers-more-nginx-module maintainers for their prompt response to my feature request ❤️

4

I don't think they're copyrightable anyway, but I'm not a lawyer 🤷