(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43397361

Drewdevault.com的文章《请停止将你的成本直接强加到我的脸上》(Please stop externalizing your costs directly into my face)在Hacker News上引发了关于LLM公司数据抓取的讨论。Drew表达了对LLM抓取他和其他人网站数据的沮丧,这些行为消耗了网站资源,并呼吁人们停止将LLM和AI图像生成器合法化。 评论者们讨论了潜在的解决方案,例如使用Cloudflare或DDoS防护,但也承认了这对开放网络的更广泛影响。人们担心LLM公司使用住宅代理来伪装抓取流量,以及未经同意使用数据的问题。一些用户认为LLM抓取类似于DDoS攻击,突出了其给网站所有者带来的成本负担。讨论还涉及到互联网向需要复杂基础设施来对抗此类问题的方向发展,这可能会损害小型网站和开源项目。

相关文章
  • 别把你的成本直接转嫁到我身上。 2025-03-18
  • (评论) 2024-07-31
  • (评论) 2024-04-29
  • (评论) 2024-06-16
  • (评论) 2025-02-23

  • 原文
    Hacker News new | past | comments | ask | show | jobs | submit login
    Please stop externalizing your costs directly into my face (drewdevault.com)
    85 points by Tomte 46 minutes ago | hide | past | favorite | 25 comments










    > If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

    After using Claude code for an afternoon, I have to say I don't think this bubble is going to burst any time soon.



    If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

    Does this statement have ramifications for sourcehut and the type of projects allowed there? Or is it merely personal opinion?



    Good rant!

    The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not? Behavior? Volume? Unlikely coincidence?

    > random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.



    There are commercial services that provide residential proxies, i.e. you get to tunnel your scraper or bot traffic through actual residential connections. (see: Bright Data, oxylabs, etc.)

    They accomplish this by providing home users with some app that promises to pay them money for use of their connection. (see: HoneyGain, peer2profit, etc.)

    Interestingly, the companies selling the tunnel service to companies and the ones paying home users to run an app are sometimes different, or at least they use different brands to cater to the two sides of the market. It also wouldn't surprise me if they sold capacity to each other.

    I suspect some of these LLM companies (or the ones they outsource to capture data) some of their traffic through these residential proxy services. It's funny because some of these companies already have a foothold inside homes (Google Nest and Amazon Alexa devices, etc.) but for a number of reasons (e.g. legal) they would probably rather go through a third-party.



    They can be local LLMs doing search, some SETI@Home style distributed work, or else.

    I host my blog at Mataroa, and the stats show 2K hits for some pages on some days. I don't think my small blog and writings are that popular. Asking ChatGPT about these subjects with my name results in some hilarious hallucinations, pointing to poor training due to lack of data.

    IOW, my blog is also being scraped and used to train LLMs. Not nice. I really share the sentiment with Drew. I never asked/consented for this.



    It sucks to say, but maybe the best solution is to pay Cloudflare or another fancy web security/CDN company to get rid of this problem for you?

    Which is extremely dumb, but when the alternatives are “10x capacity just for stupid bots” or “hire a guy whose job it is to block LLMs”… maybe that’s cheapest? Yes, it sucks for the open web, but if it’s your livelihood I probably would consider it.

    (Either that or make following robots.txt a legal requirement, but that feels also like stifling hobbyists that just want to scrape a page)



    He's just right


    I really wonder how to best deal with this issue, almost seems like all web traffic needs to be behind CDNs which is horrible for the open web.


    The internet is evolving to a state that if you don't have a n level deep stack, you're cooked.

    Which is ugly in so many levels.



    I had to take offline my personal gitea instance as it was regularly spammed, crashing Caddy in the process.


    just like with crypto, AI culture is fundamentally tainted (see this very thread where people are defending this shit). Legislation hasn’t caught up and doing what’s essentially a crime (functionally this is just a DDoS) is being sold as a cool thing to do by the people who don’t foot the bill. If you ever ask yourself why everybody else hates you and your ilk, this attitude is why. The thing I don’t understand with AI though is that without the host, your parasite is nothing. Shouldn’t it be in your best interest to not kill it?


    What's surprising to me is that established companies (OpenAI, Anthropic, etc) would resort to such aggressive scanning to extract a profit. Even faking the user agent and using residential IPs is really scummy.


    They're doing something awesome, why wouldn't they (in their own words), I ask. What they do boils my blood, honestly.

    Assuming that established companies are automatically ethical and just is not correct. Meta used a "laptop and a client tuned to seed as minimally as possible to torrent 81TBs of copyrighted material" to train their models.

    For every picture perfect and shiny facade, there's an engineer sitting on the floor in corner and hacking things to prop up this facade.



    The internet you knew and mastered no longer exists.


    All things unregulated eventually turn into a mogadishu style eco-system where parasitism is ammunition. Open source are the plants of this brave new disgusting world. To be grazed upon by the animals, that suppressed the protection system of the plants- aka the state.


    Was with drew until the solution:

    >Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop.

    Anyone with any inkling of the current industry understands this is /Old-Man-Yells-At-Cloud. Why offer this as a solution even, I'm unsure.

    Here's potential solutions: 1) Cloudflare. 2) Own DDoS protecting reverse proxy using captcha etc. 3) Poisoning scraping results (with syntax errors in code lol), so they ditch you as a good source of data.



    Now ask me how I know that your preferred solution to not stabbed in the chest while walking on the street is to just wear stab vests wherever you go.


    With your offered solutions it looks like you missed the whole point of the article.


    I had a friend that said he'd never get a mobile phone, and he did hold out until maybe 2010. He eventually realized the world had changed.


    > I had a friend that said he'd never get a mobile phone, and he did hold out...

    Up until this point you had a good start to a very inspiring story. ;)



    devhc? :D


    Welcome to 2025...


    I work for a hosting and I know what this is like. And, while I completely respect that you don’t want to give out your resources for free, a properly programmed and/or cached website wouldn’t be brought down by crawlers, no matter how aggressive. Crawlers are hitting our clients sites all the same but you only hear problems from those who have piece of shit websites that take seconds to generate.


    git blame is always expansive to compute; and precomputing (or caching) it for every revision of every file is going to consume a lot of storage.


    Precomputing git blame should take the same order of magnitude of storage as the repository itself. Same number of lines for each revision, same number of lines changed in every commit.






    Join us for AI Startup School this June 16-17 in San Francisco!


    Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



    Search:
    联系我们 contact @ memedata.com