机器人对人工智能数据的渴望正在压垮网站。
Bots are overwhelming websites with their hunger for AI data

原始链接: https://www.theregister.com/2025/06/17/bot_overwhelming_websites_report/

GLAM-E实验室最新报告显示,美术馆、图书馆、档案馆和博物馆(GLAM机构)正面临人工智能机器人大量抓取其内容用于训练AI模型的冲击。这种激进的数据收集行为正在消耗它们的资源,甚至导致服务中断。对43家机构的调查发现,大多数机构都经历了因AI训练机器人导致的流量激增,传统的robots.txt指令已失效。虽然AWS和Cloudflare等解决方案提供了一些防御措施,但并非完全有效。该报告也反映了开放获取资源库、维基媒体基金会和其他在线平台的类似担忧。GLAM-E实验室认为,人工智能公司需要采取更负责任的数据访问方式,因为GLAM机构无法无限期地承担因抵御机器人攻击而维护在线馆藏所带来的不断增加的成本。需要一种可持续的数据获取方法来保护文化遗产的访问。

Hacker News 的一个讨论重点突出了 AI 数据饥渴型机器人席卷网站的日益严重的问题。网站所有者正在努力应对不断增加的服务器负载和缓慢的加载时间,这是由于无情的抓取造成的,即使在实施了限速措施之后也是如此。许多机器人忽略 robots.txt 指令,使传统的阻止方法无效。 讨论探讨了可能的解决方案,包括为未登录用户缓存数据、工作量证明系统以及社区维护的 AI 爬虫 IP 范围列表。一些人建议可能需要付费墙和私人服务来保护内容。人们担心这对可访问性的影响,因为将内容锁定在登录后面可能会阻碍搜索引擎索引和整体网络可见性。讨论质疑谁应该负责,以及这主要是公司还是个人构建个人 LLM,以及刑事处罚是否可能是一种解决方案。最终,越来越多的共识表明,互联网将转向一个更加受限、以盈利为导向的互联网,这可能会损害开放共享。
相关文章

原文

Bots harvesting content for AI companies have proliferated to the point that they're threatening digital collections of arts and culture.

Galleries, Libraries, Archives, and Museums (GLAMs) say they're being overwhelmed by AI bots – web crawling scripts that visit websites and download data to be used for training AI models – according to a report issued on Tuesday by the GLAM-E Lab, which studies issues affecting GLAMs.

GLAM-E Lab is a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law.

Based on an anonymized survey of 43 organizations, the report indicates that cultural institutions are alarmed by the aggressive harvesting of their content, which shows no regard for the burden that data-harvesting places on websites.

"Bots are widespread, although not universal," the report says. "Of 43 respondents, 39 had experienced a recent increase in traffic. Twenty-seven of the 39 respondents experiencing an increase in traffic attributed it to AI training data bots, with an additional seven believing that bots could be contributing to the traffic."

The surge in bots that gather data for AI training, the report says, often went unnoticed until it became so bad that it knocked online collections offline.

"Respondents worry that swarms of AI training data bots will create an environment of unsustainably escalating costs for providing online access to collections," the report says.

The institutions commenting on these concerns have differing views about when the bot surge began. Some report noticing it as far back in 2021 while others only began noticing web scraper traffic this year.

Some of the bots identify themselves, but some don't. Either way, the respondents say that robots.txt directives – voluntary behavior guidelines that web publishers post for web crawlers – are not currently effective at controlling bot swarms.

Bot defenses offered by the likes of AWS and Cloudflare do appear to help, but GLAM-E Lab acknowledges that the problem is complex. Placing content behind a login may not be effective if an institution's goal is to provide public access to digital assets. And there may be a reason to want some degree of bot traffic, such as bots that index sites for search engines.

The GLAM-E Lab survey echoes the findings of a similar report issued earlier this month by the Confederation of Open Access Repositories (COAR) based on the responses of 66 open access repositories run by libraries, universities, and other institutions.

The COAR report says: "Over 90 percent of survey respondents indicated their repository is encountering aggressive bots, usually more than once a week, and often leading to slowdowns and service outages. While there is no way to be 100 percent certain of the purpose of these bots, the assumption in the community is that they are AI bots gathering data for generative AI training."

The GLAM-E Lab survey also recalls complaints about abusive bots raised by The Wikimedia Foundation, Sourcehut, Diaspora developer Dennis Schubert, repair site iFixit, and documentation project ReadTheDocs.

Ultimately, the GLAM-E report argues that AI providers need to develop more responsible ways to interact with other websites.

"The cultural institutions that host online collections are not resourced to continue adding more servers, deploying more sophisticated firewalls, and hiring more operations engineers in perpetuity," the report says. "That means it is in the long-term interest of the entities swarming them with bots to find a sustainable way to access the data they are so hungry for." ®

联系我们 contact @ memedata.com