Please stop externalizing your costs directly into my face

throwawayffffas · 2025-03-18T10:25:41 1742293541

> If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

After using Claude code for an afternoon, I have to say I don't think this bubble is going to burst any time soon.

petercooper · 2025-03-18T10:27:24 1742293644

If you personally work on developing LLMs et al, know this: I will never work with you again, and I will remember which side you picked when the bubble bursts.

Does this statement have ramifications for sourcehut and the type of projects allowed there? Or is it merely personal opinion?

MathMonkeyMan · 2025-03-18T10:02:43 1742292163

Good rant!

The clever LLM scrapers sound a lot like residential traffic. How does the author know it's not? Behavior? Volume? Unlikely coincidence?

> random User-Agents that overlap with end-users and come from tens of thousands of IP addresses – mostly residential, in unrelated subnets, each one making no more than one HTTP request over any time period we tried to measure – actively and maliciously adapting and blending in with end-user traffic and avoiding attempts to characterize their behavior or block their traffic.

future10se · 2025-03-18T10:19:36 1742293176

There are commercial services that provide residential proxies, i.e. you get to tunnel your scraper or bot traffic through actual residential connections. (see: Bright Data, oxylabs, etc.)

They accomplish this by providing home users with some app that promises to pay them money for use of their connection. (see: HoneyGain, peer2profit, etc.)

Interestingly, the companies selling the tunnel service to companies and the ones paying home users to run an app are sometimes different, or at least they use different brands to cater to the two sides of the market. It also wouldn't surprise me if they sold capacity to each other.

I suspect some of these LLM companies (or the ones they outsource to capture data) some of their traffic through these residential proxy services. It's funny because some of these companies already have a foothold inside homes (Google Nest and Amazon Alexa devices, etc.) but for a number of reasons (e.g. legal) they would probably rather go through a third-party.

bayindirh · 2025-03-18T10:05:33 1742292333

They can be local LLMs doing search, some SETI@Home style distributed work, or else.

I host my blog at Mataroa, and the stats show 2K hits for some pages on some days. I don't think my small blog and writings are that popular. Asking ChatGPT about these subjects with my name results in some hilarious hallucinations, pointing to poor training due to lack of data.

IOW, my blog is also being scraped and used to train LLMs. Not nice. I really share the sentiment with Drew. I never asked/consented for this.

easton · 2025-03-18T10:00:22 1742292022

It sucks to say, but maybe the best solution is to pay Cloudflare or another fancy web security/CDN company to get rid of this problem for you?

Which is extremely dumb, but when the alternatives are “10x capacity just for stupid bots” or “hire a guy whose job it is to block LLMs”… maybe that’s cheapest? Yes, it sucks for the open web, but if it’s your livelihood I probably would consider it.

(Either that or make following robots.txt a legal requirement, but that feels also like stifling hobbyists that just want to scrape a page)

me2too · 2025-03-18T10:19:26 1742293166

He's just right

myaccountonhn · 2025-03-18T10:10:29 1742292629

I really wonder how to best deal with this issue, almost seems like all web traffic needs to be behind CDNs which is horrible for the open web.

bayindirh · 2025-03-18T10:12:10 1742292730

The internet is evolving to a state that if you don't have a n level deep stack, you're cooked.

Which is ugly in so many levels.

h4kor · 2025-03-18T10:12:35 1742292755

I had to take offline my personal gitea instance as it was regularly spammed, crashing Caddy in the process.

sussmannbaka · 2025-03-18T10:18:03 1742293083

just like with crypto, AI culture is fundamentally tainted (see this very thread where people are defending this shit). Legislation hasn’t caught up and doing what’s essentially a crime (functionally this is just a DDoS) is being sold as a cool thing to do by the people who don’t foot the bill. If you ever ask yourself why everybody else hates you and your ilk, this attitude is why. The thing I don’t understand with AI though is that without the host, your parasite is nothing. Shouldn’t it be in your best interest to not kill it?

stavros · 2025-03-18T10:22:24 1742293344

What's surprising to me is that established companies (OpenAI, Anthropic, etc) would resort to such aggressive scanning to extract a profit. Even faking the user agent and using residential IPs is really scummy.

bayindirh · 2025-03-18T10:26:06 1742293566

They're doing something awesome, why wouldn't they (in their own words), I ask. What they do boils my blood, honestly.

Assuming that established companies are automatically ethical and just is not correct. Meta used a "laptop and a client tuned to seed as minimally as possible to torrent 81TBs of copyrighted material" to train their models.

For every picture perfect and shiny facade, there's an engineer sitting on the floor in corner and hacking things to prop up this facade.

kelseydh · 2025-03-18T10:17:30 1742293050

The internet you knew and mastered no longer exists.

InDubioProRubio · 2025-03-18T10:26:42 1742293602

All things unregulated eventually turn into a mogadishu style eco-system where parasitism is ammunition. Open source are the plants of this brave new disgusting world. To be grazed upon by the animals, that suppressed the protection system of the plants- aka the state.

thedevilslawyer · 2025-03-18T10:06:54 1742292414

Was with drew until the solution:

>Please stop legitimizing LLMs or AI image generators or GitHub Copilot or any of this garbage. I am begging you to stop using them, stop talking about them, stop making new ones, just stop.

Anyone with any inkling of the current industry understands this is /Old-Man-Yells-At-Cloud. Why offer this as a solution even, I'm unsure.

Here's potential solutions: 1) Cloudflare. 2) Own DDoS protecting reverse proxy using captcha etc. 3) Poisoning scraping results (with syntax errors in code lol), so they ditch you as a good source of data.

antiloper · 2025-03-18T10:11:20 1742292680

Now ask me how I know that your preferred solution to not stabbed in the chest while walking on the street is to just wear stab vests wherever you go.

42lux · 2025-03-18T10:09:47 1742292587

With your offered solutions it looks like you missed the whole point of the article.

colordrops · 2025-03-18T10:05:14 1742292314

I had a friend that said he'd never get a mobile phone, and he did hold out until maybe 2010. He eventually realized the world had changed.

rambambram · 2025-03-18T10:11:29 1742292689

> I had a friend that said he'd never get a mobile phone, and he did hold out...

Up until this point you had a good start to a very inspiring story. ;)

johnisgood · 2025-03-18T10:19:22 1742293162

devhc? :D

fungiblecog · 2025-03-18T10:07:39 1742292459

Welcome to 2025...

jisnsm · 2025-03-18T10:07:44 1742292464

I work for a hosting and I know what this is like. And, while I completely respect that you don’t want to give out your resources for free, a properly programmed and/or cached website wouldn’t be brought down by crawlers, no matter how aggressive. Crawlers are hitting our clients sites all the same but you only hear problems from those who have piece of shit websites that take seconds to generate.

progval · 2025-03-18T10:13:48 1742292828

git blame is always expansive to compute; and precomputing (or caching) it for every revision of every file is going to consume a lot of storage.

Kinrany · 2025-03-18T10:23:17 1742293397

Precomputing git blame should take the same order of magnitude of storage as the repository itself. Same number of lines for each revision, same number of lines changed in every commit.

（评论） (comments)

（评论）
(comments)