互联网不再是安全的避风港。
The Internet Is No Longer a Safe Haven

原始链接: https://brainbaking.com/post/2025/10/the-internet-is-no-longer-a-safe-haven/

2025年10月底,该网站遭到爬虫程序的攻击,导致服务器压力增大,并促使考虑更强大的防御措施,例如“Anubis”。作者对这一反复出现的问题感到遗憾,认为这又是人工智能爬取降低软件业余爱好乐趣的一种方式,同时还存在环境影响和批判性思维方面的问题。 攻击使服务器不堪重负,接收到大量Git提交日志请求,触发了Fail2ban并使CPU使用率达到峰值。虽然现有的安全措施不足以立即阻止这些狡猾地伪装浏览器身份信息的爬虫程序,但作者暂时阻止了来自新加坡阿里巴巴托管的攻击IP范围(47.79.0.0/16)。 这一事件凸显了自托管的挑战以及业余爱好者维护独立在线空间日益增加的难度。作者由于隐私问题抵制像Cloudflare这样的集中式解决方案,但承认需要更强大的防御,可能会将Gitea实例迁移到其他地方。尽管感到沮丧,他们仍然致力于维护该网站。

Hacker News 新闻 | 过去 | 评论 | 提问 | 展示 | 工作 | 提交 登录 互联网不再是安全的避风港 (brainbaking.com) 10 分,作者 akyuu 1 小时前 | 隐藏 | 过去 | 收藏 | 2 条评论 BinaryIgor 55 分钟前 [–] 我想知道为什么最近几年这类自动化爬虫和攻击有所增加?是因为有更好的(开源?)技术吗?还是因为攻击者的托管基础设施也更便宜了?两者都有?还是其他原因? 也许解决这类攻击的长期方案是将大部分互联网隐藏在某种工作量证明系统/网络之后,这样只有人类才能访问我们的网站,而不是机器。回复 trenchpilgrim 1 分钟前 | 父评论 [–] 使用人工智能,你可以在几分钟内编写一个简单的爬虫,并且现在对清洗和结构化数据有市场需求。 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

A couple of days ago, the small server hosting this website was temporarily knocked out by scraping bots. This wasn’t the first time, nor is it the first time I’m seriously considering employing more aggressive countermeasures such as Anubis (see for example the June 2025 summary post). But every time something like this happens, a portion of the software hobbyist in me dies. We should add this to the list of things AI scrapers destroy next to our environment, the creative enthusiasm of the individuals who made things that are being scraped, and our critical thinking skills.

When I tried accessing Brain Baking, I was met with an unusual delay that prompted me to login and see what’s going on. A simple top revealed both Gitea and the Fail2ban server gobbling up almost all CPU resources. Uh oh. Quickly killing Gitea didn’t reduce the work of Fail2ban as the Nginx access logs were being flooded with entries such as:

47.79.216.157 - - [27/Oct/2025:13:05:34 +0100] "GET /wgroeneveld/brainbaking/src/commit/4359ae68930de084df09e1cfa05ffd4520fb7e40/content/links.md?display=source HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.217.151 - - [27/Oct/2025:13:05:34 +0100] "GET /wgroeneveld/brainbaking/rss/commit/5911666cf0b30236cdc7590abb4e171534faf972/content/museum.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.217.32 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/7b46fd682f36af81d4852b8ee2ee9970c638cac6/layouts HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.218.157 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/4359ae68930de084df09e1cfa05ffd4520fb7e40/content/404.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.216.205 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/590574b17b0e1bb068d442d309341e98762fd55d/content/about.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.217.95 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/rss/commit/25674d6de08a667926aab89362fa7bb585cd35c5/content/links.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.218.191 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/src/commit/590574b17b0e1bb068d442d309341e98762fd55d/themes HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
47.79.216.116 - - [27/Oct/2025:13:05:35 +0100] "GET /wgroeneveld/brainbaking/rss/commit/b4eac0fb71b056cb44fe062b8f2c0949dbb08af6/content/museum.md HTTP/1.1" 502 568 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"

I have enough fail safe systems in place to block bad bots but the user agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36 isn’t immediately recognized as “bad”: it’s ridiculously easy to spoof that HTTP header. Most user agent checkers I throw this string at claim this agent isn’t a bot. That means we shouldn’t only rely on this information.

Also, I temporarily block isolated IPs that keep on poking around (e.g. rate limiting on Nginx that get pulled into the ban list) but of course these scrapers never come from a single source. Yet the base attacking IP ranges remained the same: 47.79. The website ipinfo.io can help in identifying the threat: AS45102 Alibaba (US) Technology Co., Ltd.. Huh?

Apparently, Alibaba provides hosting from Singapore that is frequently being abused by attackers. Many others that host forums software such as PhpBB experienced the same problems and although the AbuseIPDB doesn’t report recent issues on the IPs from the above logs, I went ahead and blocked the entire range.

Fail2ban was struggling to keep up: it ingests the Nginx access.log file to apply its rules but if the files keep on exploding… Piping cat access.log | grep /commit/ | cut -d " " -f 1 to instant-ban everyone trying to access Git’s commit logs simply wasn’t fast enough. The only thing that had immediate effect was sudo iptables -I INPUT -s 47.79.0.0/16 -j DROP.

In case that wasn’t yet clear: I hate having to deal with this. It’s a waste of time, doesn’t hold back the next attack coming from another range, and intervening always happens too late. But worst of all, semi-random fire fighting is just one big mood killer. I just know this won’t be enough. Having a robust anti attacker system in place might increase the odds but that means either resorting to hand cannons like Anubis or moving the entire hosting to CloudFlare that will do it for me. But I don’t want to fiddle with even more moving components and configuration, nor do I want to route my visitors through tracking-enabled USA servers.

That Gitea instance should be moved off-site, or better yet, I should move the migration to Codeberg to the top of my TODO list. Yet it’s sad to see that people who like fiddling with their own little servers are increasingly punished for doing so, pushing many to a centralized solution, making things worse in the long term. The internet is no longer a safe haven for software hobbyists. I could link to dozens of other bloggers who reported similar issues to further solidify my point.

Other things I’ve noticed is increased traffic with Referer headers coming from strange websites such as bioware.com, mcdonalds.com, and microsoft.com. It’s not like any of these giants are going to link to an article on this site. I don’t understand what the purpose of spoofing that header is besides upping the hits count?

However worse things might get, I refuse to give in.

It’s just like 50 Cent said: Get Hostin’ Or Die Tryin’.

webdesign   scraping  AI 

联系我们 contact @ memedata.com