谢谢你,人工智能
End of an era for me: no more self-hosted git

原始链接: https://www.kraxel.org/blog/2026/01/thank-you-ai/

经过13年运行自托管Git服务器(更早之前使用CVS),作者决定关闭它,原因是人工智能爬虫持续不断的滥用。这些爬虫用低效的请求淹没了服务器,特别是cgit Web前端,导致性能问题,最终导致停止服务的决定。 虽然令人沮丧,但作者选择不与爬虫进行持续的斗争。幸运的是,大多数仓库已经在GitLab和GitHub等平台上有了镜像,现在这些平台是代码的主要位置。链接已更新以指向这些平台。 作者仍然维护一个自托管的Web服务器,用于静态博客(使用Jekyll构建),希望其静态特性能更好地抵抗爬取。然而,即使这个服务器也经历了一次短暂的宕机,原因是爬虫生成的404错误导致日志文件耗尽,需要进行配置修复。

最近一篇Hacker News上的帖子引发了关于网络抓取激增的讨论,一些人将其归因于人工智能。原作者详细描述了被这些抓取程序压垮的情况,导致其网站出现大量404错误。 评论者质疑人工智能是否真的负责,认为编写不良的传统抓取程序或更复杂的攻击,绕过标准的机器人检测方法,更有可能是罪魁祸首。 存在关于为什么这个问题会成为头条新闻的争论,以及对互联网因这种活动而变成“数字沙漠”的担忧。 有人提出了Cloudflare等解决方案,但许多用户报告其效果有限,即使*使用*此类服务,也因为抓取技术日益复杂,并且可能存在经济利益驱动。 一些人批评Cloudflare试图将针对这些抓取程序的保护措施货币化是一种“骗局”。 一个关键点是缺乏公开可用的日志来分析抓取程序的行为并确认其来源。
相关文章

原文

Ok, it is over. End of an era for me. No more self-hosted git. I had a public git server running since 2011, and a public cvs server before that. AI scrapers have hammered the poor, little server to death by flooding the cgit frontend with tons of pointless² requests. Actually a few months ago already.

Now I finally decided to not try rebuild the server, be it with or without cgit web frontend. I don't feel like taking up the fight with the scrapers in my spare time, I leave that to people who are in a better position to do so. Most repositories had mirrors on one or two of the large gitforges already. Those are the primary repositories now. Go look at gitlab and github.

Last week I've fixed all (I hope) dangeling links to the cgit repsitories to point to the forges instead.

Now I'm down to one self-hosted service, which is the webserver hosting mainly this blog and a few more little things. In 2018 I've migrated the blog from wordpress to jekyll, so it is all static pages. Taking this out by AI scrapers overloading the machine should be next to impossible, and so far this has hold up.

Nevertheless AI scrapers already managed to trigger one outage. Apparently millions of 404 answers where not enough to convince the bots that there is no cgit service (any more). Apache had no problems to deliver those, but the logs have filled up the disk so fast that logrotate didn't manage to keep things under control with the default configuration. Fixed config. Knook wood.


¹ Title inspired by the 2025 edition of Security Nightmares. Fun watching if you speak german.
² Most inefficient way to get the complete repo. Just clone it, ok?

联系我们 contact @ memedata.com