别把你的成本直接转嫁到我身上。
Please stop externalizing your costs directly into my face

原始链接: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externalizing-your-costs-on-me.html

SourceHut 的资源正遭受大规模语言模型(LLM)爬虫的无情滥用,导致频繁中断并延误关键开发工作。这并非个例;以往的问题包括加密货币挖矿滥用(需要付费CI)以及 Go 模块镜像的过度 git 克隆导致的拒绝服务攻击。这些 LLM 爬虫无视 robots.txt,消耗昂贵的端点,并模仿合法用户流量,使得缓解工作极其困难。作者和其他系统管理员一直在不断地与这些机器人作斗争,影响了他们专注于核心工作和用户体验的能力。作者谴责将成本转嫁到小型平台的行为,要求那些从 LLM 中获利的人为公益事业做出贡献,而不是造成损害。他们敦促人们停止将 LLM 及相关技术合法化和使用,强调其环境、社会和伦理问题,并声明他们不会与任何参与其开发的人合作。

Drewdevault.com的文章《请停止将你的成本直接强加到我的脸上》(Please stop externalizing your costs directly into my face)在Hacker News上引发了关于LLM公司数据抓取的讨论。Drew表达了对LLM抓取他和其他人网站数据的沮丧,这些行为消耗了网站资源,并呼吁人们停止将LLM和AI图像生成器合法化。 评论者们讨论了潜在的解决方案,例如使用Cloudflare或DDoS防护,但也承认了这对开放网络的更广泛影响。人们担心LLM公司使用住宅代理来伪装抓取流量,以及未经同意使用数据的问题。一些用户认为LLM抓取类似于DDoS攻击,突出了其给网站所有者带来的成本负担。讨论还涉及到互联网向需要复杂基础设施来对抗此类问题的方向发展,这可能会损害小型网站和开源项目。

原文

Over the past few months, instead of working on our priorities at SourceHut, I have spent anywhere from 20-100% of my time in any given week mitigating hyper-aggressive LLM crawlers at scale. This isn’t the first time SourceHut has been at the wrong end of some malicious bullshit or paid someone else’s externalized costs – every couple of years someone invents a new way of ruining my day.

Four years ago, we decided to require payment to use our CI services because it was being abused to mine cryptocurrency. We alternated between periods of designing and deploying tools to curb this abuse and periods of near-complete outage when they adapted to our mitigations and saturated all of our compute with miners seeking a profit. It was bad enough having to beg my friends and family to avoid “investing” in the scam without having the scam break into my business and trash the place every day.

Two years ago, we threatened to blacklist the Go module mirror because for some reason the Go team thinks that running terabytes of git clones all day, every day for every Go project on git.sr.ht is cheaper than maintaining any state or using webhooks or coordinating the work between instances or even just designing a module system that doesn’t require Google to DoS git forges whose entire annual budgets are considerably smaller than a single Google engineer’s salary.

Now it’s LLMs. If you think these crawlers respect robots.txt then you are several assumptions of good faith removed from reality. These bots crawl everything they can find, robots.txt be damned, including expensive endpoints like git blame, every page of every git log, and every commit in every repo, and they do so using random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

We are experiencing dozens of brief outages per week, and I have to review our mitigations several times per day to keep that number from getting any higher. When I do have time to work on something else, often I have to drop it when all of our alarms go off because our current set of mitigations stopped working. Several high-priority tasks at SourceHut have been delayed weeks or even months because we keep being interrupted to deal with these bots, and many users have been negatively affected because our mitigations can’t always reliably distinguish users from bots.

All of my sysadmin friends are dealing with the same problems. I was asking one of them for feedback on a draft of this article and our discussion was interrupted to go deal with a new wave of LLM bots on their own server. Every time I sit down for beers or dinner or to socialize with my sysadmin friends it’s not long before we’re complaining about the bots and asking if the other has cracked the code to getting rid of them once and for all. The desperation in these conversations is palpable.

Whether it’s cryptocurrency scammers mining with FOSS compute resources or Google engineers too lazy to design their software properly or Silicon Valley ripping off all the data they can get their hands on at everyone else’s expense… I am sick and tired of having all of these costs externalized directly into my fucking face. Do something productive for society or get the hell away from my servers. Put all of those billions and billions of dollars towards the common good before sysadmins collectively start a revolution to do it for you.

Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop. If blasting CO2 into the air and ruining all of our freshwater and traumatizing cheap laborers and making every sysadmin you know miserable and ripping off code and books and art at scale and ruining our fucking democracy isn’t enough for you to leave this shit alone, what is?

If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

联系我们 contact @ memedata.com