我担忧未经认证的网络
I fear for the unauthenticated web

原始链接: https://sethmlarson.dev/i-fear-for-the-unauthenticated-web

作者担忧LLM和AI公司未经授权抓取网络内容的趋势日益增长,这可能会损害小型网站和服务的利益。起初,抓取行为主要针对拥有强大保护措施的大型网站,但现在已经扩展到像GNOME GitLab服务器这样的小型项目代码库。作者担心这很快就会影响到像Mastodon这样的平台,甚至是一些个人网站,可能需要广泛实施身份验证或JavaScript挑战。 他们批评这些公司为了“糟糕的聊天机器人”而牺牲开放网络的健康,并警告说由于抓取滥用,可能会产生意想不到的云基础设施账单,因为肇事者通常会匿名使用。作者建议设置账单限制以减轻潜在的经济损失,并批评寻求赔偿的困难。他们邀请读者分享他们的想法并提供联系方式,同时也推广了其他文章和内容。

Hacker News 上的一个帖子讨论了网站越来越多地要求登录的趋势,原因是速率限制和 AI 爬取等问题。原帖担心“无需身份验证的网络”面临风险。评论者们就版权声明阻止 AI 训练使用网站内容的有效性展开了辩论,一些人认为大型语言模型公司无视版权或以“合理使用”为由进行辩护。有人建议需要更严格的许可证。几位用户指出数据被爬取的不可避免性,并质疑对 Cloudflare 等服务的依赖。一位评论者认为对大型语言模型的过度担忧被夸大了,并指出了它们的实用性。另一位则责怪硅谷利用公共网络牟利。反驳观点认为科技行业历史上一直支持无限制的抓取,而由于大型语言模型而产生的反对情绪并不足以证明立即改变法律的合理性。

原文
I fear for the unauthenticated web

LLM and AI companies seem to all be in a race to breathe the last breath of air in every room they stumble into. This practice started with larger websites, ones that already had protection from malicious usage like denial-of-service and abuse in the form of services like Cloudflare or Fastly.

But the list of targets has been getting longer. At this point we're seeing LLM and AI scrapers targeting small project forges like the GNOME GitLab server.

How long until scrapers start hammering Mastodon servers? Individual websites? Are we going to have to require authentication or JavaScript challenges on every web page from here on out?

All this for what, shitty chat bots? What an awful thing that these companies are doing to the web.

I suggest everyone that uses cloud infrastructure for hosting set-up a billing limit to avoid an unexpected bill in case they're caught in the cross-hairs of a negligent company. All the abusers anonymize their usage at this point, so good luck trying to get compensated for damages.

Have thoughts or questions? Send them my way:

sethmlarson.99 (Signal)
[email protected]
@[email protected]

Want more articles like this one? Get notified of new posts by subscribing to the RSS feed or the email newsletter. I won't share your email or send spam, only whatever this is!

Want more content now? This blog's archive has 110 ready-to-read articles. I also curate a list of cool URLs I find on the internet.

Find a typo? This blog is open source, pull requests are appreciated.

Thanks for reading! ♡ This work is licensed under CC BY-SA 4.0

联系我们 contact @ memedata.com