SourceHut, an open source git-hosting service, says web crawlers for AI companies are slowing down services through their excessive demands for data.
"SourceHut continues to face disruptions due to aggressive LLM crawlers," the biz reported Monday on its status page. "We are continuously working to deploy mitigations. We have deployed a number of mitigations which are keeping the problem contained for now. However, some of our mitigations may impact end-users."
SourceHut said it had deployed Nepenthes, a tar pit to catch web crawlers that scrape data primarily for training large language models, and noted that doing so might degrade access to some web pages for users.
"We have unilaterally blocked several cloud providers, including GCP [Google Cloud] and [Microsoft] Azure, for the high volumes of bot traffic originating from their networks," the biz said, advising administrators of services that integrate with SourceHut to get in touch to arrange an exception to the blocking.
This is not the first time SourceHut has borne the bandwidth burden of serving unrestrained web requests. The outfit raised similar objections to Google's Go Module Mirror in 2022, likening the traffic overload a denial of service attack. And other open source projects such as GMP have also faced this problem.
But AI crawlers have been particularly ill-behaved over the past two years as the generative AI boom has played out. OpenAI in August 2023 made it known its web crawlers would respect robots.txt files, a set of directives served by websites to tell crawlers whether they're welcome. Other AI providers made similar commitments.
Nonetheless, reports of abuse continue. Repair website iFixit raised the issue last July when Anthropic's Claudebot was accused of excessive crawling.
In December 2024, cloud hosting service Vercel said AI crawlers have become a significant presence. In preceding past month, the biz said, OpenAI's GPTbot generated 569 million requests on its network while Anthropic's Claude accounted for 370 million. Together, these AI crawlers accounted for about 20 percent of the 4.5 billion requests from Googlebot, used for Google's search indexing, during the same period.
Later that month, Diaspora developer Dennis Schubert also noted a surge in AI bots. In a post, he said that 70 percent of the traffic to his server in the previous 60 days came from LLM training bots.
The Register asked Schubert about this in early January. "Funnily enough, a few days after the post went viral, all crawling stopped," he responded at the time. "Not just on the Diaspora wiki, but on my entire infrastructure. I'm not entirely sure why, but here we are."
The problem didn't entirely go away, he said, because the visibility of his post inspired internet trolls to create their own wiki crawlers that now masquerade as the OpenAI GPTbot.
The result has been that it's more difficult to do log analysis.
... it's just a**holes trying to be funny
"For example, I placed a 'canary' into the robots.txt now, and that now has reached almost a million hits, including hits with the GPTBot user agent string," explained Schubert. "The problem is just that those requests are absolutely not from OpenAI. OpenAI appears to be using Microsoft Azure for their crawlers. But all those canary hits came from AWS IPs and even some US residential ISPs. So it's just assholes trying to be funny spoofing their [user-agent] string."
Meanwhile, reports of ill-behaved AI crawlers continue as do efforts to thwart them. And the spoofing of user-agent strings has also been reported in response to claims that Amazon's Amazonbot has been overloading a developer's server.
According to DoubleVerify, an ad metrics firm, general invalid traffic – aka GIVT, bots that should not be counted as ad views – rose by 86 percent in the second half of 2024 due to AI crawlers.
The firm said, "a record 16 percent of GIVT from known-bot impressions in 2024 were generated by those that are associated with AI scrapers, such as GPTBot, ClaudeBot and AppleBot."
The ad biz also observed that while some bots, such as the Meta AI bot and AppleBot, declare they're out to gather data for training AI, other crawlers serve a mix of purposes, which makes blocking more complicated. For example, disallowing visits from GoogleBot, which scours the web for both search and AI, could hinder the site's search visibility.
To avoid that, Google in 2023 implemented a robots.txt token called Google-Extended that sites can use to prevent their web content from being used for training the internet titan's Gemini and Vertex AI services while still allowing those sites to be indexed for search. ®