(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43424340

Hacker News 上的一个帖子讨论了网站越来越多地要求登录的趋势,原因是速率限制和 AI 爬取等问题。原帖担心“无需身份验证的网络”面临风险。评论者们就版权声明阻止 AI 训练使用网站内容的有效性展开了辩论,一些人认为大型语言模型公司无视版权或以“合理使用”为由进行辩护。有人建议需要更严格的许可证。几位用户指出数据被爬取的不可避免性,并质疑对 Cloudflare 等服务的依赖。一位评论者认为对大型语言模型的过度担忧被夸大了,并指出了它们的实用性。另一位则责怪硅谷利用公共网络牟利。反驳观点认为科技行业历史上一直支持无限制的抓取,而由于大型语言模型而产生的反对情绪并不足以证明立即改变法律的合理性。

相关文章
  • 我担忧未经认证的网络 2025-03-20
  • (评论) 2025-03-18
  • (评论) 2025-03-20
  • (评论) 2025-03-18
  • (评论) 2024-07-31

  • 原文
    Hacker News new | past | comments | ask | show | jobs | submit login
    I Fear for the Unauthenticated Web (sethmlarson.dev)
    22 points by SethMLarson 1 hour ago | hide | past | favorite | 12 comments










    Rate limiting is the first step before cutting everything off behind forced logins.

    > This practice started with larger websites, ones that already had protection from malicious usage like denial-of-service and abuse in the form of services like Cloudflare or Fastly

    FYI Cloudflare has a very usable free tier that’s easy to set up. It’s not limited to large websites.



    I get the feeling that I'm going to read a blog post in a few years telling us that the CDN companies have been selling everything pulled through their cache to the AI companies since 2022


    And even if they don't, is everything depending on Cloudflare to stay online a good thing?


    I would think all you need to do is add a copyright statement of some kind.

    Sad things are getting to this point. Maybe I should add this to my site :)

    (c) Copyright (my email), if used for any form of LLM processing, you must contact me and pay 1000USD per word from my site for each use.



    It's reasonably likely, but not yet settled, that LLM training falls under fair use and doesn't require a license. This is what the https://githubcopilotlitigation.com/ class action (from 2022) is about, and its still making its way through the court. This prediction market has it at 12% likely to succeed, suggesting that courts will not agree with you: https://manifold.markets/JeffKaufman/will-the-github-copilot...


    Copyright is for topics like redistribution of the source material. You can’t add arbitrary terms to a copyright claim that go beyond what copyright law supports.

    I think you’re confusing copyright with a EULA. You would need users to agree to the EULA terms before viewing the material. You can’t hide contractual obligations in the footer of your website and call it copyright.



    Such a notice is legally meaningless, though. Doubly so if the courts rule that scraping for AI purposes counts as fair use.


    The reality is that a lot of these small websites have very permissive licenses. I really hope we don't get to the point where we must all make our licenses stricter.


    The reality is that none of these LLM scrapers give a damn about copyright, because the entire AI industry is built on flagrant copyright violation, and the premise that they can be stopped by a magic string is laughable.

    You could sue, if you can afford it, meanwhile all of your data is already training their models.



    For some reason I am not really moved by a lot of the hand wringing I am seeing lately.

    It's a not a binary thing to me: LLMs are not god, but even without AGI, they have proven wildly useful to me. Calling them "shitty chat bots" doesn't sway me.

    Further I have always assumed that everything that I post to the web is publicly accessible to everyone/everything. We lost any battle we thought we could wage some 2+ decades ago when web crawlers started hoovering up data from our sites.



    Yet another entry in the long and shameful history of Silicon Valley abusing the public square for their own profit (or in this case, fantasies of profit) and the rest of us just have to learn to live with it because the justice system simply will not even try and give us recourse.

    Move fast and break things apparently has a bonus clause for the things you break not being your responsibility to fix.



    I don't think the justice system is the one to blame here. Right up until LLMs and their huge datamining operations appeared, everyone in tech was strongly for unrestricted scraping. Everybody here cheered the LinkedIn decision [0], saying "it's on the public web: if you didn't want it to be scraped, you should've put it behind authentication". LLMs change nothing about the legal landscape, they've just convinced everyone on an emotional level that unrestricted scraping is no longer an automatic good. It's not the justice system's job to react to such vibe shifts, the laws themselves have to be changed.

    [0]: https://news.ycombinator.com/item?id=15012883







    Join us for AI Startup School this June 16-17 in San Francisco!


    Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



    Search:
    联系我们 contact @ memedata.com