FOSS infrastructure is under attack by AI companies

ericholscher · 2025-03-20T13:05:31 1742475931

Yep -- our story here: https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse... (quoted in the OP) -- everyone I know has a similar story who is running large internet infrastructure -- this post does a great job of rounding a bunch of them up in 1 place.

I called it when I wrote it, they are just burning their goodwill to the ground.

I will note that one of the main startups in the space worked with us directly, refunded our costs, and fixed the bug in their crawler. Facebook never replied to our emails, the link in their User Agent led to a 404 -- an engineer at the company saw our post and reached out, giving me the right email -- which I then emailed 3x and never got a reply.

pjc50 · 2025-03-20T13:14:54 1742476494

> just burning their goodwill to the ground

AI firms seem to be leading from a position that goodwill is irrelevant: a $100bn pile of capital, like an 800lb gorilla, does what it wants. AI will be incorporated into all products whether you like it or not; it will absorb all data whether you like it or not.

UncleMeat · 2025-03-20T13:21:55 1742476915

Yep. And it is much more far reaching than that. Look at the primary economic claim offered by AI companies: to end the need for a substantial portion of all jobs on the planet. The entire vision is to remake the entire world into one where the owners of these companies own everything and are completely unconstrained. All intellectual property belongs to them. All labor belongs to them. Why would they need good will when they own everything?

"Why should we care about open source maintainers" is just a microcosm of the much larger "why should we care about literally anybody" mindset.

davidmurdoch · 2025-03-20T13:23:59 1742477039

We, the people, might need to come up with a few proverbial tranquilizer guns here soon

anthk · 2025-03-20T13:18:34 1742476714

AI tarpits && lim (human curated contant/mediocre AI answers -> 0) = AI's crumbling into dust by themselves.

aspir · 2025-03-20T13:22:57 1742476977

Just a callout that Fastly provides free bot detection, CDN, and other security services for FOSS projects, and has been for 10+ years https://www.fastly.com/fast-forward (disclaimer, I work for Fastly and help with this program)

Without going into too much detail, this tracks with the trends in inquiries we're getting from new programs and existing members. A few years ago, the requests were almost exclusively related to performance, uptime, implementing OWASP rules in a WAF, or more generic volumetric impact. Now, AI scraping is increasingly something that FOSS orgs come to us for help with.

QuadrupleA · 2025-03-20T13:28:28 1742477308

Isn't this just poor, sloppy crawler implementation? You shouldn't need to fetch a repo more than once to add it to a training set.

lelanthran · 2025-03-20T13:27:05 1742477225

The big takeaway here is that Google's (and advertisement in general) dominance over the web is going away.

This is because the only way to stop the bots is with a captcha, and this also stops search indexers from indexing your site. This will result in search engines not indexing sites, and hence providing no value anymore.

There's probably going to be a small lag as the current knowledge in the current LLMs dry up because no one can scrape the web in an automated fashion anymore.

It'll all burn down.

brushfoot · 2025-03-20T13:24:37 1742477077

At this rate, it's more than FOSS infrastructure -- although that's a canary in the coalmine I especially sympathize with -- it's anonymous Internet access altogether.

Because you can put your site behind an auth wall, but these new bots can solve the captchas and imitate real users like never before. Particularly if they're hitting you from residential IPs and with fake user agents like the ones in the article -- or even real user agents because they're wired up to something like Playwright.

What's left except for sites to start requiring credit cards, Worldcoin, or some equally depressing equivalent.

totetsu · 2025-03-20T13:04:01 1742475841

Could one put a mangler on the responses to suspected bots to poison their data sets with nonsense code.. :/

hans_castorp · 2025-03-20T13:21:27 1742476887

I thought the same. Maybe start prefixing each commit message with "LLVM bots are all lying bastards" or something similar :)

flir · 2025-03-20T13:14:46 1742476486

I wonder if that would just make them try harder. Scrape multiple times and diff, to gain confidence that the data hasn't been posioned.

We've never had one of these arms races end up with the defenders winning.

postexitus · 2025-03-20T13:20:10 1742476810

as long as you can detect the bot is an ai crawler.

niemandhier · 2025-03-20T13:13:25 1742476405

This. Not only mangle the content. Flood the bot with tailored misinformation and things that are illegal in this particular legislation but not yours.

They will never respect you, but the second they notice this hurts their business more than it gains them, they will stop.

moomin · 2025-03-20T13:22:17 1742476937

Combine this with the Anubis tech...

xena · 2025-03-20T13:22:30 1742476950

It's really surreal to see my project in the preview image like this. That's wild! If you want to try it: https://github.com/TecharoHQ/anubis. So far I've noticed that it seems to actually work. I just deployed it to xeiaso.net as a way to see how it fails in prod for my blog.

sir-alien · 2025-03-20T13:19:27 1742476767

It's going to get to the point where everything will be put behind a login to prevent LLM scrapers scanning a site. Annoying but the only option I can think of. If they use an account for scraping you just ban the account.

napolux · 2025-03-20T13:23:03 1742476983

and then they'll login there too...

FeepingCreature · 2025-03-20T13:26:27 1742477187

To be clear, this is not an attack in the deliberate sense, and has nothing to do with AI except in that AI companies want to crawl the internet. This is more "FOSS sites damaged by extreme incompetence and unaccountability." The crawlers could just as well be search engine startups.

wiredfool · 2025-03-20T13:18:41 1742476721

Across my sites -- mostly open data sites -- the top 10 referrers are all bots. That doesn't include the long tail of randomized user agents that we get from the Alibaba netblocks.

At this point, I think we're well under 1% actual users on a good day.

megadata · 2025-03-20T13:18:48 1742476728

Perhaps time to start a central community ban pool for IP ranges?

MartijnBraam · 2025-03-20T13:21:36 1742476896

Doesn't really work if crawlers are coming from the IP ranges of AWS and Azure etc...

Tostino · 2025-03-20T13:28:05 1742477285

Or sometimes they use consumer IP proxies. Makes it even harder because sometimes those IPs get reused for actual users.

KolmogorovComp · 2025-03-20T13:14:52 1742476492

My question is, can we serve PoW challenges to these AI LLM scrapers that can be profitable?

xena · 2025-03-20T13:18:22 1742476702

That's what I've been doing! It works shockingly well. https://github.com/TecharoHQ/anubis

jsheard · 2025-03-20T13:21:31 1742476891

I'm curious if the PoW component is really necessary, AIUI these types of blanket untargeted scrapers are usually curl wrappers which don't run Javascript, so any link which is injected at runtime would be invisible to them. Unless AI companies are so flush with cash that they can afford to just use headless Chrome for everything, efficiency be damned.

KolmogorovComp · 2025-03-20T13:20:55 1742476855

Unless I am missing something, the result of that generated work has no monetary value though.

rswail · 2025-03-20T13:16:03 1742476563

Finally, a reason for bitcoins!

immibis · 2025-03-20T13:27:16 1742477236

This is probably a dumb question, but have they tried sending abuse reports to hosting providers? Or even lawsuits? Most hosting providers take it seriously when their client is sending a DoS attack, because if they don't, they can get kicked off the internet by their provider.

marginalia_nu · 2025-03-20T13:24:38 1742477078

I wonder if the future is for honest crawlers to do something like DKIM to provide a cheap cryptographically verifiable identity, where reputation can be staked on good behavior, and to treat the rest of the traffic like it's a full fledged chrome instance that had better be capable of solving hashcash challenges when traffic gets too hot.

It's a shitty solution, but as it stands the status quo is quite untenable and will eventually have cloudflare as a spooky MITM for all the web's traffic.

brandonmenc · 2025-03-20T13:15:35 1742476535

This article starts by citing a blog article - displays a screenshot of the article - but doesn't link to it.

unwind · 2025-03-20T13:20:57 1742476857

It was this: [1], posted here yesterday I think.

[1]: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

roenxi · 2025-03-20T13:14:05 1742476445

We're close to finding a clear use-case for Bitcoin with this one.

wyager · 2025-03-20T13:28:31 1742477311

Interesting. Basically moving the proof-of-work off the user's phone and to a dedicated mine. Websites could just have a lightning wallet or something and auto-charge the user 1e-7 bitcoin to access the page.

briandear · 2025-03-20T13:21:01 1742476861

Does GitHub have this problem?

napolux · 2025-03-20T13:19:35 1742476775

In the past month, I've had to block LLM bots attacking my poor little VPSs — not once, but twice.

First, it was Facebook https://news.ycombinator.com/item?id=23490367 and now it's these other companies.

What's worse? They completely ignore a simple HTTP 429 status.

nonrandomstring · 2025-03-20T13:06:46 1742476006

These are DDOS attacks and should be treated in law as such. (Although I do realise that in many countries now we no longer have any effective "rule of law")

rvnx · 2025-03-20T13:15:52 1742476552

In the world, the richest wins, not the nicest (cf. Sam Altman)

briandear · 2025-03-20T13:22:26 1742476946

You going to prosecute China?

WesolyKubeczek · 2025-03-20T13:16:02 1742476562

At some point it's easier to geoblock a whole country at the firewall level and loginwall the rest of the world, rather than trying to explain that in your jurisdiction, which is not their jurisdiction, what they are doing is a crime — which they don't give a single fuck about.

WesolyKubeczek · 2025-03-20T13:12:17 1742476337

...at some point, some people started appreciating mailing lists and the distributed nature of Git again.

ebiester · 2025-03-20T13:27:33 1742477253

Yes and no.

The distributed nature of git is fine until you want to serve it to the world - then, you're back to bad actors. They're looking for commits because it's nicely chunked, I'm taking a guess.

anthk · 2025-03-20T13:15:30 1742476530

And Usenet, and IRC with a registered user prereq to join.

Also, set AI tarpits as fake links with recursive calls. Make then mad with non-curated bullshit made from Markov chain generators until their cache begins to rot forever.

ewzimm · 2025-03-20T13:20:49 1742476849

This problem will likely only get worse, so I'd be interested to see how people adapt. I was thinking about sending data through the mail like the old days. Maybe we go back to the original Tim Berners-Lee Xanadu setup charging users small amounts for access but working out ISP or VPN deals to give subscribers enough credit to browse without issues.

（评论） (comments)

（评论）
(comments)