我如何保护我的 Forgejo 实例免受 AI 网络爬虫的侵害

我如何保护我的 Forgejo 实例免受 AI 网络爬虫的侵害
How I protect my Forgejo instance from AI web crawlers

原始链接: https://her.esy.fun/posts/0031-how-i-protect-my-forgejo-instance-from-ai-web-crawlers/index.html

本文详细介绍了一种轻量级的解决方案，用于保护 Forgejo 实例（代码托管平台）免受恶意网络爬虫的攻击，这些爬虫通过过度请求提交数据来使服务器过载。作者最初阻止了所有访问，然后尝试了 Anubis（一种机器人保护系统），但发现它过于笨重和复杂。实施的解决方案使用了一个简单的 Nginx 配置。它检查每个请求中是否存在特定的 cookie。如果 cookie 不存在，用户将被重定向到一个页面，该页面通过 JavaScript *设置* cookie，然后重新加载页面。这有效地阻止了不执行 JavaScript 的爬虫，同时最大限度地减少了对合法用户的冲击，他们只会经历一次重定向。作者承认这种方法很容易被绕过，但认为爬虫的数量意味着适应需要时间。这是一个“快速而粗糙”的修复方案，至少目前而言，优于像 Anubis 这样更复杂的解决方案。作者还指出，依赖 JavaScript 可能会是一个限制。

这个Hacker News讨论集中在保护Forgejo（Gitea的一个分支）实例免受AI网络爬虫的侵害。发帖者分享了他们的防御方法，引发了关于有效策略的讨论。一个关键问题是爬虫反复请求仓库zip下载，产生大量未清理的文件并消耗服务器资源。一种解决方案是完全禁用zip下载功能。其他评论者建议使用基于IP的阻止规则（通过Tirreno等工具），并利用Cloudflare的“按爬取付费”功能。一个反复出现的主题是，优化服务器性能通常可以缓解抓取问题，尤其是在流量有限的自托管实例中。最后，有人指出应该将知识产权问题与性能问题分开，建议将速度作为应对大多数抓取尝试的可行防御手段。

原文

Put that in your nginx config:

location / {
  # needed to still allow git clone from http/https URLs
  if ($http_user_agent ~* "git/|git-lfs/") {
        set $bypass_cookie 1;
  }
  # If we see the expected cookie; we could also bypass the blocker page
  if ($cookie_Yogsototh_opens_the_door = "1") {
        set $bypass_cookie 1;
  }
  # Redirect to 418 if neither condition is met
  if ($bypass_cookie != 1) {
     add_header Content-Type text/html always;
     return 418 '<script>document.cookie = "Yogsototh_opens_the_door=1; Path=/;"; window.location.reload();</script>';
  }
  # rest of your nginx config

Preferably run a string replace from Yogsototh_opens_the_door to your own personal Cookie name.

Main advantage, is that it is almost invisible to the users of my website compartively to other solutions like Anubis.

Not so long ago I started to host my code to forgejo. There is a promise that in the future it will support federation and forgejo is the same project that is used for codeberg.

The only problem I had was one day, I discovered that my entire node was down. At first I didn't investigate and just restarted the node. But soon after a few hours, it was down again. Looking at the reason, clearly thousands of requests that looked at every commit which put too much pressure on the system. Who could be so interested in using the web API to look at every commit instead, of… you know, clone the repository locally and explore it. Quickly, yep, like so many of you, I discovered that tons of crawlers that did not respect the robots.txt are crawling my forgejo instance until death ensues.

So I had no choice, I first used a radical approach and blocked my website entirely except from me. But hey, why having a public forge if not for people to be able to look into it time to time?

I then installed Anubis, but it wasn't really for me. It is way too heavy for my needs, not as easy as I would have hoped to configure and install.

Then I saw this article You don't need anubis on lobste.rs using a simple configuration in caddy that should block these pesky crawlers. I made some adjustments to adapt it to nginx. For now, this is working perfectly well, my users are just redirected once, without really noticing it. And they could use forgejo as they could before. And this puts the crawlers away.

The strategy is pretty basic; in fact, a lot less advanced than the strategy adopted by Anubis. For every access of my website, I just check if the user has a specific cookie set. If not, I redirect the user to a 418 HTML page containing some js code to execute that set this specific cookie and reload the page.

That's it.

I also tried to return a 302 and add a cookie from the response without javascript, but the crawlers are immune to that second strategy. Unfortunately this means, my website could only be seen if you enable javascript in your browser. I feel this is acceptable. I guess, someday this very basic protection will not be enough and my forgejo instance will break again, and I will be forced to use more advanced system like Anubis or perhaps even iocaine.

I hope this could be helpful, because, I recently saw many discussions on that subject where people were not totally happy to use Anubis, while at least for me, this quick dirty fix does the trick. And I am fully aware that this would be very easy to bypass. But for now, I think the volume is more important than the quality for these crawlers and it may take a while for them to need to adapt. Also, by publishing this, I know if too many people use the same trick, quickly, these crawlers will adapt.

我如何保护我的 Forgejo 实例免受 AI 网络爬虫的侵害 How I protect my Forgejo instance from AI web crawlers

我如何保护我的 Forgejo 实例免受 AI 网络爬虫的侵害
How I protect my Forgejo instance from AI web crawlers