AI爬虫还没有学会与网站友好相处。
AI crawlers haven't learned to play nice with websites

原始链接: https://www.theregister.com/2025/03/18/ai_crawlers_sourcehut/

SourceHut(一个git托管服务)和其他开源项目正在遭受服务中断,原因是用于训练AI模型的激进网络爬虫。这些爬虫对带宽的需求过高,类似于拒绝服务攻击。SourceHut已经屏蔽了谷歌云和Azure等云提供商以减轻这个问题,这可能会影响一些用户。 虽然OpenAI等AI提供商承诺尊重robots.txt文件,但滥用的报道仍然存在。维修网站iFixit和云托管服务Vercel也报告了AI爬虫的问题。一位开发者注意到LLM训练机器人流量激增,但在问题获得关注后停止了。这导致一些恶意用户伪造OpenAI的用户代理字符串,使得日志分析变得困难。 由于AI爬虫,无效流量(GIVT)显著增加,其中很大一部分归因于AI抓取工具。一些机器人会声明其目的,而另一些机器人的目的则比较复杂,这使得阻止策略变得复杂。谷歌实施了一个“Google-Extended”令牌,允许网站阻止其内容被用于AI训练,同时保持搜索可见性。

Hacker News 上的一个帖子讨论了一篇关于AI爬虫不尊重网站礼仪的文章。评论者们表达了对AI公司,特别是那些资源雄厚的公司,将数据获取置于版权和网站稳定性等伦理考虑之上的担忧。一位评论者指出,这些爬虫在公众关注到它们的行为后能够停止活动,这表明它们是故意无视规则,直到可能面临后果才采取行动。另一位评论者希望数据投毒能够得到更广泛的应用,以此作为对抗这种激进数据抓取的防御机制。总体而言,人们认为AI爬虫故意无视既定规范,只在被迫时才处理后果,这促使人们需要采取反制措施。

原文

SourceHut, an open source git-hosting service, says web crawlers for AI companies are slowing down services through their excessive demands for data.

"SourceHut continues to face disruptions due to aggressive LLM crawlers," the biz reported Monday on its status page. "We are continuously working to deploy mitigations. We have deployed a number of mitigations which are keeping the problem contained for now. However, some of our mitigations may impact end-users."

SourceHut said it had deployed Nepenthes, a tar pit to catch web crawlers that scrape data primarily for training large language models, and noted that doing so might degrade access to some web pages for users.

"We have unilaterally blocked several cloud providers, including GCP [Google Cloud] and [Microsoft] Azure, for the high volumes of bot traffic originating from their networks," the biz said, advising administrators of services that integrate with SourceHut to get in touch to arrange an exception to the blocking.

This is not the first time SourceHut has borne the bandwidth burden of serving unrestrained web requests. The outfit raised similar objections to Google's Go Module Mirror in 2022, likening the traffic overload a denial of service attack. And other open source projects such as GMP have also faced this problem.

But AI crawlers have been particularly ill-behaved over the past two years as the generative AI boom has played out. OpenAI in August 2023 made it known its web crawlers would respect robots.txt files, a set of directives served by websites to tell crawlers whether they're welcome. Other AI providers made similar commitments.

Nonetheless, reports of abuse continue. Repair website iFixit raised the issue last July when Anthropic's Claudebot was accused of excessive crawling.

In December 2024, cloud hosting service Vercel said AI crawlers have become a significant presence. In preceding past month, the biz said, OpenAI's GPTbot generated 569 million requests on its network while Anthropic's Claude accounted for 370 million. Together, these AI crawlers accounted for about 20 percent of the 4.5 billion requests from Googlebot, used for Google's search indexing, during the same period.

Later that month, Diaspora developer Dennis Schubert also noted a surge in AI bots. In a post, he said that 70 percent of the traffic to his server in the previous 60 days came from LLM training bots.

The Register asked Schubert about this in early January. "Funnily enough, a few days after the post went viral, all crawling stopped," he responded at the time. "Not just on the Diaspora wiki, but on my entire infrastructure. I'm not entirely sure why, but here we are."

The problem didn't entirely go away, he said, because the visibility of his post inspired internet trolls to create their own wiki crawlers that now masquerade as the OpenAI GPTbot.

The result has been that it's more difficult to do log analysis.

... it's just a**holes trying to be funny

"For example, I placed a 'canary' into the robots.txt now, and that now has reached almost a million hits, including hits with the GPTBot user agent string," explained Schubert. "The problem is just that those requests are absolutely not from OpenAI. OpenAI appears to be using Microsoft Azure for their crawlers. But all those canary hits came from AWS IPs and even some US residential ISPs. So it's just assholes trying to be funny spoofing their [user-agent] string."

Meanwhile, reports of ill-behaved AI crawlers continue as do efforts to thwart them. And the spoofing of user-agent strings has also been reported in response to claims that Amazon's Amazonbot has been overloading a developer's server.

According to DoubleVerify, an ad metrics firm, general invalid traffic – aka GIVT, bots that should not be counted as ad views – rose by 86 percent in the second half of 2024 due to AI crawlers.

The firm said, "a record 16 percent of GIVT from known-bot impressions in 2024 were generated by those that are associated with AI scrapers, such as GPTBot, ClaudeBot and AppleBot."

The ad biz also observed that while some bots, such as the Meta AI bot and AppleBot, declare they're out to gather data for training AI, other crawlers serve a mix of purposes, which makes blocking more complicated. For example, disallowing visits from GoogleBot, which scours the web for both search and AI, could hinder the site's search visibility.

To avoid that, Google in 2023 implemented a robots.txt token called Google-Extended that sites can use to prevent their web content from being used for training the internet titan's Gemini and Vertex AI services while still allowing those sites to be indexed for search. ®

联系我们 contact @ memedata.com