Recent AI model progress feels mostly like bullshit

a3w · 2025-04-06T19:29:34 1743967774

For three years now, my experience with LLMs has been "mostly useless, prefer ELIZA".

Which is software written 1966, but the web version is a little newer. Does occasional psychotherapy assistance/brainstorming just as well, and I more easily know when I stepped out of its known range into the extrapolated.

throw310822 · 2025-04-06T19:19:06 1743967146

I hope it's true. Even if LLMs development stopped now, we would still keep finding new uses for them at least for the next ten years. The technology is evolving way faster than we can meaningfully absorb it and I am genuinely frightened by the consequences. So I hope we're hitting some point of diminishing returns, although I don't believe it a bit.

joelthelion · 2025-04-06T19:17:32 1743967052

I've used gemini 2.5 this weekend with aider and it was frightenly good.

It probably depends a lot on what you are using them for, and in general, I think it's still too early to say exactly where LLMs will lead us.

jchw · 2025-04-06T19:23:34 1743967414

I think overall quality with Gemini 2.5 is not much better than Gemini 2 in my experience. Gemini 2 was already really good, but just like Claude 3.7, Gemini 2.5 goes some steps forward and some steps backwards. It sometimes generates some really verbose code even when you tell it to be succinct. I am pretty confident that if you evaluate 2.5 for a bit longer you'll come to the same conclusion eventually.

jonahx · 2025-04-06T19:01:29 1743966089

My personal experience is right in line with the author's.

Also:

> I think what's going on is that large language models are trained to "sound smart" in a live conversation with users, and so they prefer to highlight possible problems instead of confirming that the code looks fine, just like human beings do when they want to sound smart.

I immediately thought: That's because in most situations this is the purpose of language, at least partially, and LLMs are trained on language.

maccard · 2025-04-06T19:12:00 1743966720

My experience as someone who uses LLMs and a coding assist plugin (sometimes), but is somewhat bearish on AI is that GPT/Claude and friends have gotten worse in the last 12 months or so, and local LLMs have gone from useless to borderline functional but still not really usable for day to day.

Personally, I think the models are “good enough” that we need to start seeing the improvements in tooling and applications that come with them now. I think MCP is a good step in the right direction, but I’m sceptical on the whole thing (and have been since the beginning, despite being a user of the tech).

fxtentacle · 2025-04-06T18:45:43 1743965143

I'd say most of the recent AI model progress has been on price.

A 4-bit quant of QwQ-32B is surprisingly close to Claude 3.5 in coding performance. But it's small enough to run on a consumer GPU, which means deployment price is now down to $0.10 per hour. (from $12+ for models requiring 8x H100)

xiphias2 · 2025-04-06T19:16:32 1743966992

Have you compared it with 8-bit QwQ-17B?

In my evals 8 bit quantized smaller Qwen models were better, but again evaluating is hard.

shostack · 2025-04-06T18:59:27 1743965967

Yeah, I'm thinking of this from a Wardley map standpoint.

What innovation opens up when AI gets sufficiently commoditized?

bredren · 2025-04-06T19:18:11 1743967091

One thing I’ve seen is large enterprises extracting money from consumers by putting administrative burden on them.

For example, you can see this in health insurance reimbursements and wireless carriers plan changes. (ie, Verizon’s shift from Do More, etc to what they have now)

Companies basically set up circumstances where consumers lose small amounts of money on a recurring basis or sporadically enough that the people will just pay the money rather than a maze of calls, website navigation and time suck to recover funds due to them or that shouldn’t have been taken in the first place.

I’m hopeful well commoditized AI will give consumers a fighting chance at this and other types of disenfranchisement that seems to be increasingly normalized by companies that have consultants that do nothing but optimize for their own financial position.

mentalgear · 2025-04-06T19:14:58 1743966898

Brute force, brute force everything at least for the domains you can have automatic verification in.

sema4hacker · 2025-04-06T19:02:51 1743966171

> ...whatever gains these companies are reporting to the public, they are not reflective of economic usefulness or generality.

I'm not surprised, because I don't expect pattern matching systems to grow into something more general and useful. I think LLM's are essentially running into the same limitations that the "expert systems" of the 1980's ran into.

billyp-rva · 2025-04-06T19:02:30 1743966150

> [T]here are ~basically~ no public benchmarks for security research... nothing that gets at the hard parts of application pentesting for LLMs, which are 1. Navigating a real repository of code too large to put in context, 2. Inferring a target application's security model, and 3. Understanding its implementation deeply enough to learn where that security model is broken.

A few months ago I looked at essentially this problem from a different angle (generating system diagrams from a codebase). My conclusion[0] was the same as here: LLMs really struggle to understand codebases in a holistic way, especially when it comes to the codebase's strategy and purpose. They therefore struggle to produce something meaningful from it like a security assessment or a system diagram.

[0] https://www.ilograph.com/blog/posts/diagrams-ai-can-and-cann...

gundmc · 2025-04-06T18:48:16 1743965296

This was published the day before Gemini 2.5 was released. I'd be interested if they see any difference with that model. Anecdotally, that is the first model that really made me go wow and made a big difference for my productivity.

usaar333 · 2025-04-06T19:22:35 1743967355

Ya, I find this hard to imagine aging well. Gemini 2.5 solved (at least much better than) multiple real world systems questions I've had in the past that other models could not. Its visual reasoning also jumped significantly on charts (e.g. planning around train schedules)

Even Sonnet 3.7 was able to do refactoring work on my codebase sonnet 3.6 could not.

Really not seeing the "LLMs not improving" story

jonahx · 2025-04-06T18:52:50 1743965570

I doubt it. It still flails miserably like the other models on anything remotely hard, even with plenty of human coaxing. For example, try to get it to solve: https://www.janestreet.com/puzzles/hall-of-mirrors-3-index/

flutas · 2025-04-06T19:04:24 1743966264

FWIW 2.5-exp was the only one that managed to get a problem I asked it right, compared to Claude 3.7 and o1 (or any of the other free models in Cursor).

It was reverse engineering ~550MB of Hermes bytecode from a react native app, with each function split into a separate file for grep-ability and LLM compatibility.

The others would all start off right then quickly default to just greping randomly what they expected it to be, which failed quickly. 2.5 traced the function all the way back to the networking call and provided the expected response payload.

All the others hallucinated the networking response I was trying to figure out. 2.5 Provided it exactly enough for me to intercept the request and using the response it provided to get what I wanted to show up.

arkmm · 2025-04-06T19:24:21 1743967461

How did you fit 550MB of bytecode into the context window? Was this using 2.5 in an agentic framework? (i.e. repeated model calls and tool usage)

Xenoamorphous · 2025-04-06T19:01:55 1743966115

I’d say the average person wouldn’t understand that problem, let alone solve it.

georgemcbay · 2025-04-06T18:57:14 1743965834

As someone who was wildly disappointed with the hype around Claude 3.7, Gemini 2.5 is easily the best programmer-assistant LLM available, IMO.

But it still feels more like a small incremental improvement rather than a radical change, and I still feel its limitations constantly.

Like... it gives me the sort of decent but uninspired solution I would expect it to generate without predictably walking me through a bunch of obvious wrong turns as I repeatedly correct it as I would have to have done with earlier models.

And that's certainly not nothing and makes the experience of using it much nicer, but I'm still going to roll my eyes anytime someone suggests that LLMs are the clear path to imminently available AGI.

softwaredoug · 2025-04-06T18:47:50 1743965270

I think the real meaningful progress is getting ChatGPT 3.5 level quality running anywhere you want rather than AIs getting smarter at high level tasks. This capability being ubiquitous and not tied to one vendor is really what’s revolutionary.

djha-skin · 2025-04-06T18:49:18 1743965358

> Since 3.5-sonnet, we have been monitoring AI model announcements, and trying pretty much every major new release that claims some sort of improvement. Unexpectedly by me, aside from a minor bump with 3.6 and an even smaller bump with 3.7, literally none of the new models we've tried have made a significant difference on either our internal benchmarks or in our developers' ability to find new bugs. This includes the new test-time OpenAI models.

This is likely a manifestation of the bitter lesson[1], specifically this part:

> The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project [like an incremental model update], massively more computation inevitably becomes available.

(Emphasis mine.)

Since the ultimate success strategy of the scruffies[2] or proponents of search and learning strategies in AI is Moore's Law, short term gains using these strategies will be miniscule. It is over at least a five year period that their gains will be felt the most. The neats win the day in the short term, but the hare in this race will ultimately give away to the steady plod of the tortoise.

1: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

2: https://en.m.wikipedia.org/wiki/Neats_and_scruffies#CITEREFM...

ohgr · 2025-04-06T18:49:37 1743965377

It’s not even approaching the asymptotic line of promises made at any achievable rate for the amount of cash being thrown at it.

Where’s the business model? Suck investors dry at the start of a financial collapse? Yeah that’s going to end well…

maccard · 2025-04-06T19:19:43 1743967183

> where’s the business model?

For who? Nvidia sell GPUs, OpenAI and co sell proprietary models and API access, and the startups resell GPT and Claude with custom prompts. Each one is hoping that the layer above has a breakthrough that makes their current spend viable.

If they do, then you don’t want to be left behind, because _everything_ changes. It probably won’t, but it might.

That’s the business model

boxed · 2025-04-06T18:48:32 1743965312

> So maybe there's no mystery: The AI lab companies are lying, and when they improve benchmark results it's because they have seen the answers before and are writing them down. [...then says maybe not...]

Well.. they've been caught again and again red handed doing exactly this. Fool me once shame on you, fool me 100 times shame on me.

smnplk · 2025-04-06T18:59:11 1743965951

Fool me once, shame on you...If fooled, you cant get fooled again.

https://www.youtube.com/shorts/LmFN8iENTPc

photochemsyn · 2025-04-06T18:58:51 1743965931

Will LLMs end up like compilers? Compilers are also fundamentally important to modern industrial civilization - but they're not profit centers, they're mostly free and open-source outside a few niche areas. Knowing how to use a compiler effectively to write secure and performative software is still a valuable skill - and LLMs are a valuable tool that can help with that process, especially if the programmer is on the steep end of the learning curve - but it doesn't look like anything short of real AGI can do novel software creation without a human constantly in the loop. The same argument applies to new fundamental research, even to reviewing and analyzing new discoveries that aren't in the training corpus.

Wasn't it back in the 1980s that you had to pay $1000s for a good compiler? The entire LLM industry might just be following in the compiler's footsteps.

paulsutter · 2025-04-06T18:56:51 1743965811

Im able to get substantially more coding done than three months ago. This could be largely in the tooling (coding agents, deep research). But the models are better too, for both coding and brainstorming. And tooling counts, to me, as progress.

Learning to harness current tools helps to harness future tools. Work on projects that will benefit from advancements, but can succeed without them.

mountainriver · 2025-04-06T19:30:09 1743967809

Yes I am a better engineer with every release. I think this is mostly empirically validated

dghlsakjg · 2025-04-06T19:07:29 1743966449

I'm not sure if I'm able to do more of the hard stuff, but a lot of the easy but time consuming stuff is now easily done by LLMs.

Example: I frequently get requests for data from Customer Support that used to require 15 minutes of my time noodling around writing SQL queries. I can cut that down to less than a minute now.

（评论） (comments)

（评论）
(comments)