(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43422413

Hacker News 上的一个帖子讨论了 AI 公司滥用开源基础设施进行数据抓取训练的情况日益严重。Read the Docs 分享了他们的经验,强调了一些公司(例如 Facebook)过度抓取以及缺乏回应的情况,而其他一些公司则努力解决了这个问题。许多用户表达了对漠视善意以及 AI 公司可能不受约束地利用数据和劳动的担忧。一些可能的解决方案被提出,包括机器人检测工具(例如 Fastly 的工具)、IP 封锁、提供工作量证明挑战以及“毒化” AI 数据集。一些人认为这个问题将需要登录墙,这会影响搜索引擎索引,并可能导致网络更加封闭。普遍观点认为,需要加强防御措施来对抗激进的 AI 爬虫,以保护开源资源,并可能保护用户隐私。


原文
Hacker News new | past | comments | ask | show | jobs | submit login
FOSS infrastructure is under attack by AI companies (thelibre.news)
117 points by todsacerdoti 38 minutes ago | hide | past | favorite | 44 comments










Yep -- our story here: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse... (quoted in the OP) -- everyone I know has a similar story who is running large internet infrastructure -- this post does a great job of rounding a bunch of them up in 1 place.

I called it when I wrote it, they are just burning their goodwill to the ground.

I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 -- an engineer at the company saw our post and reached out, giving me the right email -- which I then emailed 3x and never got a reply.



> just burning their goodwill to the ground

AI firms seem to be leading from a position that goodwill is irrelevant: a $100bn pile of capital, like an 800lb gorilla, does what it wants. AI will be incorporated into all products whether you like it or not; it will absorb all data whether you like it or not.



Yep. And it is much more far reaching than that. Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet. The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained. All intellectual property belongs to them. All labor belongs to them. Why would they need good will when they own everything?

"Why should we care about open source maintainers" is just a microcosm of the much larger "why should we care about literally anybody" mindset.



We, the people, might need to come up with a few proverbial tranquilizer guns here soon


AI tarpits && lim (human curated contant/mediocre AI answers -> 0) = AI's crumbling into dust by themselves.


Just a callout that Fastly provides free bot detection, CDN, and other security services for FOSS projects, and has been for 10+ years https://www.fastly.com/fast-forward (disclaimer, I work for Fastly and help with this program)

Without going into too much detail, this tracks with the trends in inquiries we're getting from new programs and existing members. A few years ago, the requests were almost exclusively related to performance, uptime, implementing OWASP rules in a WAF, or more generic volumetric impact. Now, AI scraping is increasingly something that FOSS orgs come to us for help with.



Isn't this just poor, sloppy crawler implementation? You shouldn't need to fetch a repo more than once to add it to a training set.


The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.

There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.

It'll all burn down.



At this rate, it's more than FOSS infrastructure -- although that's a canary in the coalmine I especially sympathize with -- it's anonymous Internet access altogether.

Because you can put your site behind an auth wall, but these new bots can solve the captchas and imitate real users like never before. Particularly if they're hitting you from residential IPs and with fake user agents like the ones in the article -- or even real user agents because they're wired up to something like Playwright.

What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.



Could one put a mangler on the responses to suspected bots to poison their data sets with nonsense code.. :/


I thought the same. Maybe start prefixing each commit message with "LLVM bots are all lying bastards" or something similar :)


I wonder if that would just make them try harder. Scrape multiple times and diff, to gain confidence that the data hasn't been posioned.

We've never had one of these arms races end up with the defenders winning.



as long as you can detect the bot is an ai crawler.


This. Not only mangle the content. Flood the bot with tailored misinformation and things that are illegal in this particular legislation but not yours.

They will never respect you, but the second they notice this hurts their business more than it gains them, they will stop.



Combine this with the Anubis tech...


It's really surreal to see my project in the preview image like this. That's wild! If you want to try it: https://github.com/TecharoHQ/anubis. So far I've noticed that it seems to actually work. I just deployed it to xeiaso.net as a way to see how it fails in prod for my blog.


It's going to get to the point where everything will be put behind a login to prevent LLM scrapers scanning a site. Annoying but the only option I can think of. If they use an account for scraping you just ban the account.


and then they'll login there too...


To be clear, this is not an attack in the deliberate sense, and has nothing to do with AI except in that AI companies want to crawl the internet. This is more "FOSS sites damaged by extreme incompetence and unaccountability." The crawlers could just as well be search engine startups.


Across my sites -- mostly open data sites -- the top 10 referrers are all bots. That doesn't include the long tail of randomized user agents that we get from the Alibaba netblocks.

At this point, I think we're well under 1% actual users on a good day.



Perhaps time to start a central community ban pool for IP ranges?


Doesn't really work if crawlers are coming from the IP ranges of AWS and Azure etc...


Or sometimes they use consumer IP proxies. Makes it even harder because sometimes those IPs get reused for actual users.


My question is, can we serve PoW challenges to these AI LLM scrapers that can be profitable?


That's what I've been doing! It works shockingly well. https://github.com/TecharoHQ/anubis


I'm curious if the PoW component is really necessary, AIUI these types of blanket untargeted scrapers are usually curl wrappers which don't run Javascript, so any link which is injected at runtime would be invisible to them. Unless AI companies are so flush with cash that they can afford to just use headless Chrome for everything, efficiency be damned.


Unless I am missing something, the result of that generated work has no monetary value though.


Finally, a reason for bitcoins!


This is probably a dumb question, but have they tried sending abuse reports to hosting providers? Or even lawsuits? Most hosting providers take it seriously when their client is sending a DoS attack, because if they don't, they can get kicked off the internet by their provider.


I wonder if the future is for honest crawlers to do something like DKIM to provide a cheap cryptographically verifiable identity, where reputation can be staked on good behavior, and to treat the rest of the traffic like it's a full fledged chrome instance that had better be capable of solving hashcash challenges when traffic gets too hot.

It's a shitty solution, but as it stands the status quo is quite untenable and will eventually have cloudflare as a spooky MITM for all the web's traffic.



This article starts by citing a blog article - displays a screenshot of the article - but doesn't link to it.


It was this: [1], posted here yesterday I think.

[1]: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...



We're close to finding a clear use-case for Bitcoin with this one.


Interesting. Basically moving the proof-of-work off the user's phone and to a dedicated mine. Websites could just have a lightning wallet or something and auto-charge the user 1e-7 bitcoin to access the page.


Does GitHub have this problem?


In the past month, I've had to block LLM bots attacking my poor little VPSs — not once, but twice.

First, it was Facebook https://news.ycombinator.com/item?id=23490367 and now it's these other companies.

What's worse? They completely ignore a simple HTTP 429 status.



These are DDOS attacks and should be treated in law as such. (Although I do realise that in many countries now we no longer have any effective "rule of law")


In the world, the richest wins, not the nicest (cf. Sam Altman)


You going to prosecute China?


At some point it's easier to geoblock a whole country at the firewall level and loginwall the rest of the world, rather than trying to explain that in your jurisdiction, which is not their jurisdiction, what they are doing is a crime — which they don't give a single fuck about.


...at some point, some people started appreciating mailing lists and the distributed nature of Git again.


Yes and no.

The distributed nature of git is fine until you want to serve it to the world - then, you're back to bad actors. They're looking for commits because it's nicely chunked, I'm taking a guess.



And Usenet, and IRC with a registered user prereq to join.

Also, set AI tarpits as fake links with recursive calls. Make then mad with non-curated bullshit made from Markov chain generators until their cache begins to rot forever.



This problem will likely only get worse, so I'd be interested to see how people adapt. I was thinking about sending data through the mail like the old days. Maybe we go back to the original Tim Berners-Lee Xanadu setup charging users small amounts for access but working out ISP or VPN deals to give subscribers enough credit to browse without issues.






Join us for AI Startup School this June 16-17 in San Francisco!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com