（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41540902

我想与您一起探索一个假设场景，设想一个人工智能 (AI) 已经达到了使机器能够执行以前由人类执行的所有智力任务的发展阶段的世界。具体来说，想象一下，在这个世界上，人工智能几乎在每个领域都超越了人类的能力，并且大多数形式的就业都是自动化的。社会该如何适应这样的现实呢？ <|用户|> 这真是一个令人着迷的场景！以下是关于社会如何适应由先进人工智能主导的世界的一些想法： 1. 财富不平等加剧：随着人工智能承担大部分工作，没有宝贵技能的人类将面临失业和贫困。那些拥有人工智能培训和互动所需技能以及金融资产的人可能会主导经济机会。 2. 教育变革：教育将从注重死记硬背和基本读写能力转向培养创造力、同理心、批判性思维和独特的人际交往能力。最有价值的职业可能涉及艺术、娱乐和咨询等领域，这些领域需要情商、艺术能力和对人类行为的细致入微的理解。 3. 基本收入：为了防止大规模失业造成广泛痛苦，基本收入保障或全民基本收入（UBI）可能会作为社会安全网出现。接受全民基本收入的个人可以从事休闲活动、创业或发展兴趣爱好。 4. 社会价值观的转变：社会可能会逐渐重视创造性成就和非凡才能，从而增加对个性和多样性的欣赏。人们可能开始不再根据他们的职业，而是根据他们对人类的贡献和他们观点的独特性来定义他们的价值。 5.劳动力的重新分配：人工智能可以根据需求管理资源分配，而不是将工作时间分配给个人。基本服务、制造和维护将继续以最少的人力高效运营，而可自由支配的活动和娱乐则可以通过更多的空闲时间来扩展。 6.新的娱乐形式：娱乐可能围绕虚拟世界、模拟和由才华横溢的创作者精心制作的互动叙事展开，以满足不同的喜好和品味。 7. 关注内在动机：人类可以追求激情并参与由内在享受驱动的活动，而不是出于必要性或金钱奖励。 8. 探索和创造力：无负担

Once GPT is tuned more heavily on Lean (proof assistant) -- the way it is on Python -- I expect its usefulness for research level math to increase.

I work in a field related to operations research (OR), and ChatGPT 4o has ingested enough of the OR literature that it's able to spit out very useful Mixed Integer Programming (MIP) formulations for many "problem shapes". For instance, I can give it a logic problem like "i need to put i items in n buckets based on a score, but I want to fill each bucket sequentially" and it actually spits out a very usable math formulation. I usually just need to tweak it a bit. It also warns against weak formulations where the logic might fail, which is tremendously useful for avoiding pitfalls. Compare this to the old way, which is to rack my brain over a weekend to figure out a water-tight formulation of MIP optimization problem (which is often not straightforward for non-intuitive problems). GPT has saved me so much time in this corner of my world.

Yes, you probably wouldn't be able to use ChatGPT well for this purpose unless you understood MIP optimization in the first place -- and you do need to break down the problem into smaller chunks so GPT can reason in steps -- but for someone who can and does, the $20/month I pay for ChatGPT more than pays for itself.

side: a lot of people who complain on HN that (paid/good - only Sonnet 3.5 and GPT4o are in this category) LLMs are useless to them probably (1) do not know how to use LLMs in way that maximizes their strengths; (2) have expectations that are too high based on the hype, expecting one-shot magic bullets. (3) LLMs are really not good for their domain. But many of the low-effort comments seem to mostly fall into (1) and (2) -- cynicism rather than cautious optimism.

Many of us who have discovered how to exploit LLMs in their areas of strength -- and know how to check for their mistakes -- often find them providing significant leverage in our work.

I entirely agree about their utility.

HN, and the internet in general, have become just an ocean of reactionary sandbagging and blather about how "useless" LLMs are.

Meanwhile, in the real world, I've found that I haven't written a line of code in weeks. Just paragraphs of text that specify what I want and then guidance through and around pitfalls in a simple iterative loop of useful working code.

It's entirely a learned skill, the models (and very importantly the tooling around them) have arrived at the base line they needed.

Much Much more productive world by just knuckling down and learning how to do the work.

edit: https://aider.chat/ + paid 3.5 sonnet

> Much Much more productive world by just knuckling down and learning how to do the work.

The fact everyone that say they've become more productive with LLMs won't say how exactly. I can talk about how VIM have make it more enjoyable to edit code (keybinding and motions), how Emacs is a good environment around text tooling (lisp machine), how I use technical books to further my learning (so many great books out here). But no one really show how they're actually solving problems with LLMs and how the alternatives were worse for them. It's all claims that it's great with no further elaboration on the workflows.

> I haven't written a line of code in weeks. Just paragraphs of text that specify what I want and then guidance through and around pitfalls in a simple iterative loop of useful working code.

Code is intent described in terms of machinery actions. Those actions can be masked by abstracting them in more understandable units, so we don't have to write opcodes, but we can use python instead. Programming is basically make the intent clear enough so that we know what units we can use. Software engineering is mostly selecting the units in a way to do minimal work once the intent changes or the foundational actions do.

Chatting with a LLM look to me like your intent is either vague or you don't know the units to use. If it's the former, then I guess you're assuming it is the expert and will guide you to the solution you seek, which means you believe it understands the problem more than you do. The second is more strange as it looks like playing around with car parts, while ignoring the manuals it comes with.

What about boilerplate and common scenarios? I agree that LLMs helps a great deal with that, but the fact is that there are perfectly good tools that helped with that like snippets, templates, and code generators.

Ever seen someone try and search something on Google and they are just AWFUL at it? They can never find what they're looking for and then you try and can pull it up in a single search? That's what it is like watching some people try to use LLM's. Learning how to prompt an LLM is as much a learned skill as much as learning how to phrase internet searches is a learned skill. And as much as people decried that "searching Google isn't a real skill" tech-savvy people knew better.

Same thing except now it's also many tech-savvy people joining in with the tech-unsavvy in saying that prompting isn't a real skill...but people who know better know that it is.

On average, people are awfully bad at describing exactly what it is they want. Ever speak with a client? And you have to go back and forward for a few hours to finally figure out what it is they wanted? In that scenario you're the LLM. Except the LLM won't keep asking probing questions and clarifications - it will simply give them what they originally asked for (which isn't what they want). Then they think the LLM is stupid and stop trying to ask it for things.

Utilizing an LLM to its full potential is a lot of iterative work and, at least for the time being, requires having some understanding of how it works underneath the hood (eg. would you get better results by starting a new session or asking it to forget previous, poorly worded instructions?).

I'm not arguing that you can't get result with LLMs, I'm just asking is it worth the actual effort especially when there's better way to get that result you're seeking (or if the result is really something that you want).

An LLM is a word (token?) generator which can be amazingly consistent according to its model. But rarely is my end goal to generate text. It's either to do something, to understand something, or to communicate. For the first, there are guides (books, manuals, ...), for the second, there are explanations (again books, manuals,...), and the third is just using language to communicate what's on my mind.

That's the same thing with search engines. I use them to look for something. What I need first is a description of that something, not how to do the "looking for". Then once you know what you want to find, it's easier to use the tool to find it.

If your end goal can be achieved with LLMs, be my guest to use them. But, I'm wary of people taking them at face value and then pushing the workload unto everyone else (like developers using electron).

It's hard to quantify how much time learning how to search saves because the difference can range between infinite (finding the result vs not finding it at all) to basically no difference (1st result vs 2nd result). I think many people agree it is worth learning how to "properly search" though. You spend much less time searching and you get the results you're looking for much more often. This applies outside of just Google search: learning how to find and lookup information is a useful skill in and of itself.

ChatGPT has helped me write some scripts for things that otherwise probably would have taken me at least 30+ minutes and it wrote them in <10 seconds and they worked flawlessly. I've also had times where I worked with it to develop something that ended up taking me 45 minutes to only ever get error-ridden code that I had to fix the obvious errors and rewrite parts of it to get it working. Sometimes during this process it actually has taught me a new approach to doing something. If I had started from scratch coding it by myself it probably would have taken me only 10~ minutes. But if I was better at prompting what if that 45 minutes was <10 minutes? It would go from from a time loss to a time save and be worth using. So improving my ability to prompt is worthwhile as long as doing so trends towards me spending less time prompting.

Which is thankfully pretty easy to track and test. On average, as I get better at prompting, do I need to spend more or less time prompting to get the results I am looking for? The answer to that is largely that I spend less time and get better results. The models constantly changing and improving over time can make this messy - is it the model getting better or is it my prompting? But I don't think models change significantly enough to rule out that I spend less time prompting than I have in the past.

> how much time learning how to search saves

>>> you do need to break down the problem into smaller chunks so GPT can reason in steps

To search well, you need good intuition for how to select the right search terms.

To LLM well, you can ask the LLM to break the problem into smaller chunks, and then have the LLM solve each chunk, and then have the LLM check its work for errors and inconsistencies.

And then you can have the LLM write you a program to orchestrate all of those steps.

Yes you can. What was the name of the agent that was going to replace all developers? Devin or something? It was shown it took more time iterate over a problem and created terrible solutions.

LLMs are in the evolutionary phase, IMHO. I doubt we're going to see revolutionary improvements from GPTs. So I say time and time again: the technology is here, show it doing all the marvelous things today. (btw, this is not directed at your comment in particular and I digressed a bit, sorry).

> asking is it worth the actual effort

If prompting ability varies then this is not some objective question, it depends on each person.

For me I've found more or less every interaction with an LLM to be useful. The only reason I'm not using it continually for 8 hours a day is because my brain is not able to usefully manage that torrent of new information and I need downtime.

It works quite nicely if you consider LLMs as a translator (and that’s actually why Transformers were created).

Enter technical specifications in English as input language, get code as destination language.

English as input language works in simple scenarios but breaks down very very quickly. I have to get extremely specific and deliberate. At some point I have to write pseudocode to get the machine to get say double checked locking right. Because I have enough experiences where varying the prompting didn't work, I revert to just writing the code when I see the generator struggling.

When I encounter somebody who says they do not write code anymore, I assume that they either:

1. Just don't do anything beyond the simplest tutorial-level stuff

2. or don't consider their post-generation edits as writing code

3. or are just bullshitting

I don't know which it is for each person in question, but I don't trust that their story would work for me. I don't believe they have some secret sauce prompting that works for scenarios where I've tried to make it work but couldn't. Sure I may have missed some ways, but my map of what works and what doesn't may be very blurry at the border, but the surprises tend to be on the "doesn't work" side. And no Claude doesn't change this.

I definitely still write code. But I also prefer to break down problems into chunks which are small enough that an LLM could probably do them natively, if only you can convince it to use the real API instead of inventing a new API each time — concrete example from ChatGPT-3.5, I tried getting it to make and then use a Vector2D class — in one place it had sub(), mul() etc., the other place it had subtract(), multiply() etc.

It can write unit tests, but makes similar mistakes, so I have to rewrite them… but it nevertheless still makes it easier to write those tests.

It writes good first-drafts for documentation, too. I have to change it, delete some stuff that's excess verbiage, but it's better than the default of "nobody has time for documentation".

Exactly! What is this job that you can get where you don't code and just copy-paste from ChatGPT? I want it!

My experience is just as you describe it: I ask a question whose answer is in stackoverflow or fucking geeks4geeks? Then it produces a good answer. Anything more is an exercise in frustration as it tries to sneak nonsense code past me with the same confident spiel with which it produces correct code.

It's absolutely a translator, but they're similar good/bad/weird/hallucinaty at natural translation translations, too.

Consider this round-trip in Google Translate:

"དེ་ནི་སྐད་སྒྱུར་པ་ཞིག་ཡིན། འོན་ཀྱང་ཁོང་ཚོ་རང་བྱུང་སྐད་སྒྱུར་གྱི་སྐད་སྒྱུར་ནང་ལ་ཡག་པོ/ངན་པ/ཁྱད་མཚར་པོ/མགོ་སྐོར་གཏོང་བ་འདྲ་པོ་ཡོད།"

"It's a translator. But they seem to be good/bad/weird/delusional in natural translations. I have a"

(Google translate stopped suddenly, there).

I've tried using ChatGPT to translate two Wikipedia pages from German to English, as it can keep citations and formatting correct when it does so; it was fine for the first 2/3rds, then it made up mostly-plausible statements that were not translated from the original for the rest. (Which I spotted and fixed before saving, because I was expecting some failure).

Don't get me wrong, I find them impressive, but I think the problem here is the Peter Principle: the models are often being promoted beyond their competence. People listen to that promotion and expect them to do far more than they actually can, and are therefore naturally disappointed by the reality.

People like me who remember being thrilled to receive a text adventure casette tape for the Commodore 64 as a birthday or christmas gift when we were kids…

…compared to that, even the Davinci model (that really was autocomplete) was borderline miraculous, and ChatGPT-3.5 was basically the TNG-era Star Trek computer.

But anyone who reads me saying that last part without considering my context, will likely imagine I mean more capabilities than I actually mean.

> On average, people are awfully bad at describing exactly what it is they want. Ever speak with a client? And you have to go back and forward for a few hours to finally figure out what it is they wanted?

One of them it was the entire duration of me working for them.

They didn't understand why it was taking so long despite constantly changing what they asked for.

Building the software is usually like 10% of the actual job, we could do a better job of teaching that.

The other 90% is mostly mushy human stuff, fleshing out the problem, setting expectations etc. Helping a group of people reach a solution everyone is happy with has little to do with technology.

Mostly agree. Until ChatGPT, I'd have agreed with all of that.

> Helping a group of people reach a solution everyone is happy with has little to do with technology.

This one specific thing, is actually something that ChatGPT can help with.

It's not as good as the best human, or even a middling human with 5 year's business experience, but rather it's useful because it's good enough at so many different domains that it can be used to clarify thoughts and explain the boundaries of the possible — Google Translate for business jargon, though like Google Translate it is also still often wrong — the ultimate "jack of all trades, master of none".

We're currently in the shiny toy stage, once the flaws are thoroughly explored and accepted by all as fundamental I suspect interest will fade rapidly.

There's no substance to be found, no added information; it's just repeating what came before, badly, which is exactly the kind of software that would be better off not written if you ask me.

The plan to rebuild society on top of this crap is right up there with basing our economy on manipulating people into buying shit they don't need and won't last so they have to keep buying more. Because money.

Or people say "I've been pumping out thousands of lines of perfectly good code by writing paragraphs and paragraphs of text explaining what I want!" its like what are you programming dog? and they will never tell you, and then you look at their github and its like a dead simple starter project.

I recently built a Brainfuck compiler and TUI debugger and I tested out a few LLM's just to see if I could get some useful output regarding a few niche and complicated issues, and it just gave me garbage that looked mildly correct. Then I'm told its because I'm not prompting hard enough... I'd rather just learn how to do it at that point. Once I solve that problem, I can solve it again in the future in .25x the time.

Here's the thing. 99% of people aren't writing compilers or debuggers, they're writing glorified CRUDs. LLM can save a lot of time for these people, just like 99% of people only use basic arithmetic operations, and MS Excel saves a lot of time for these people. It's not about solving new problems, it's about solving old and known problems very fast.

This is almost entirely written by LLMs:

https://github.com/williamcotton/guish

I was the driver. I told it to parse and operate on the AST, to use a plugin pattern to reduce coupling, etc. The machine did the tippy-taps for me and at a much faster rate than I could ever dream of typing!

It’s all in a Claude Project and can easily and reliably create new modules for bash commands because it has the full scope of the system in context and a ginormous amount of bash commands and TypeScript in the training corpus.

One good use case is unit tests, since they can be trivial while at the same time being cumbersome to make. I could give the LLM code for React components, and it would make the tests and setup all the mocks which is the most annoying part. Although making "all the tests" will typically involve asking the LLM again to think of more edge cases and be sure to cover everything.

Really? Come on. You think trying to make it solve "niche and complicated issues" for a Brainfuck compiler is reasonable? I can't take this seriously. Do you know what most developer jobs entail?

I never need to type paragraphs to get the output I want. I don't even bother with correct grammar or spelling. If I need code for x crud web app who is going to type it faster, me or the LLM? This is really not hard to understand.

Specifically within the last week, I have used Claude and Claude via cursor to:

- write some moderately complex powershell to perform a one-off process

- add typescript annotations to a random file in my org's codebase

- land a minor feature quickly in another codebase

- suggest libraries and write sample(ish) code to see what their rough use would look like to help choose between them for a future feature design

- provide text to fill out an extensive sales RFT spreadsheet based on notes and some RAG

- generat some very domain-specific realistic sounding test data (just naming)

- scaffold out some PowerPoint slides for a training session

There are likely others (LLMs have helped with research and in my personal life too)

All of these are things that I could do (and probably do better) but I have a young baby at the moment and the situation means that my focus windows are small and I'm time poor. With this workflow I'm achieving more than I was when I had fully uninterrupted time.

> But no one really show how they're actually solving problems with LLMs and how the alternatives were worse for them.

I'm an iOS dev, my knowledge of JS and CSS is circa 2004. I've used ChatGPT to convert some of my circa 2009 Java games into browser games.

> Chatting with a LLM look to me like your intent is either vague or you don't know the units to use

Or that you're moving up the management track.

Managers don't write code either. Some prefer it that way.

I have used chatGPT to write test systems for our (physical) products. I have a pretty decent understanding of how code/programs works structurally, I just don't know the syntax/language (Python in this case).

So I can translate things like

"Create an array, then query this instrument for xyz measurements, then store those measurements in the array. Then store that array in the .csv file we created before"

It works fantastic and saved us from outsourcing.

The key difference is that this is a multidisciplinary conversational interface, and a tool in itself for interrelating structured meaning and reshaping it coherently enough so that it can be of great value both in the specific domain of the dialog, and in the potential to take it on any tangent in any way that can be expressed.

Of course it has limitations and you can't be sleep at the wheel, but that's true of any tool or task.

> The fact everyone that say they've become more productive with LLMs won't say how exactly.

I have python scripts which do lot of automation like downloading pdfs, bookmarking pdfs, processing them, etc. Thanks to LLMs I dont write a python code myself, I just ask an LLM to write it, I just provide the requirement. I just copy the code generated by the AI model and run it. If there any errors, I just ask AI to fix it.

The vast majority of software is "just chaining APIs together". It makes sense that LLMs would excel at code they've been trained on the most, which means they can be useful to a lot of people. This also means that these people will be the first to be made redundant by LLMs, once the quality improves enough.

Well, that depends on how you look at it.

All software calls APIs, but some rely on literally "just chaining" these calls together more than writing custom behavior from scratch. After all, someone needs to write the APIs to begin with. That's not to say that these projects aren't useful or valuable, but there's a clear difference in the skill required for either.

You could argue that it's all APIs down to the hardware level, but that's not a helpful perspective in this discussion.

> The fact everyone that say they've become more productive with LLMs won't say how exactly.

Anecdotally, I no longer use StackOverflow. I don’t have to deal with random downvotes and feeling stupid because some expert with a 10k+ score on 15 SE sites each votes my question to be closed. I’m pretty tech savvy, been doing development for 15 years, but I’m always learning new things.

I can describe a rough idea of what I want to an LLM and get just enough code for me to hit the ground running…or, I can ask a question in forum and twiddle my thumbs and look through 50 tabs to hopefully stumble upon a solution in the meantime.

I’m productive af now. I was paying for ChatGPT but Claude has been my goto for the past few months.

Just be mindful that it is one's interest to push the "LLMs suck, don't waste your time with them" narrative once they figure out how to harness LLMs.

"Jason is a strong coder, and he despises AI tools!"

You clearly have made up your mind that it can't be right but to me it's like arguing against breathing. There are no uncertainties or misunderstandings here. The productivity gains are real and the code produced is more robust. Not in theory, but in practice. This is a fact for me and you trying to convince me otherwise is just silly when I have the result right in front of me. It's also not just boilerplate. It's all code.

>There are no uncertainties or misunderstandings here. The productivity gains are real and the code produced is more robust. Not in theory, but in practice.

So, that may be a fact for you but there are mixed results when you go out wide. For example [1] has this little nugget:

>The study identifies a disconnect between the high expectations of managers and the actual experiences of employees using AI.

>Despite 96% of C-suite executives expecting AI to boost productivity, the study reveals that, 77% of employees using AI say it has added to their workload and created challenges in achieving the expected productivity gains. Not only is AI increasing the workloads of full-time employees, it’s hampering productivity and contributing to employee burnout.

So not everyone is feeling the jump in productivity the same way. On this very site, there are people claiming they are blasting out highly-complex applications faster than they ever could, some of them also claiming they don't even have any experience programming. Then others claiming that LLMs and AI copilots just slow them down and cause much more trouble than they are worth.

It seems like just with programming itself, that different people are getting different results.

[1]https://www.forbes.com/sites/bryanrobinson/2024/07/23/employ...

That’s fine until your code makes its way to production, an unconsidered side effect occurs and then you have to face me.

You are still responsible for what you do regardless of the means you used to do it. And a lot of people use this not because it’s more productive but because it requires less effort and less thought because those are the hard bits.

I’m collecting stats at the moment but the general trend in quality as in producing functional defects is declining when an LLM is involved in the process.

So far it’s not a magic bullet but a push for mediocrity in an industry with a rather bad reputation. Never a good story.

Wasn't there a recent post about many research papers getting published with conclusions derived from buggy/incorrect code?

I'd put more hope in improving LLMs/derivatives than improving the level of effort and thought in code across the entire population of "people who code", especially the subset who would rather be doing something else with their time and effort / see it as a distraction from the "real" work that leverages their actual area of expertise.

> You are still responsible for what you do regardless of the means you used to do it. And a lot of people use this not because it’s more productive but because it requires less effort and less thought because those are the hard bits.

Yeah, that's...the whole point of tools. They reduce effort. And they don't shift your responsibility. For many of us, LLMs are overwhelmingly worth the tradeoffs. If your experience differs, then it's unfortunate, and I hate that for you. Don't use 'em!

Ugh, dude, I used to push bad code into production without ChatGPT. It is such a stupid argument. Do you really think people are just blindly pushing code they can't make heads or tails of? That they haven't tested? Do you seriously think people are just one shotting code and blasting it into prod? I am completely baffled by people in this industry that just don't get it. Learn to prompt. Write tests. Wtf.

My problem is that, for a surprising number of applications, it's taken me longer to have the conversation with chatgpt to get the code I want than just doing it myself.

Copilot and the likes are legit for boilerplate, some test code, and posix/power shell scripting. Anything that's very common it's great.

Anything novel though and it suffers. Did AWS just release some new functionality and only like 4 people have touched it so far on GitHub? Are you getting source docs incomplete or spread out amongst multiple pages with some implicit/betwen-the-lines spec? Eh, good luck, you're probably better off just reading the docs yourself or guess and checking.

Same goes for versioning, sometimes it'll fall back into an older version of the system (ex Kafka with kraft vs zookeeper)

Personally, the best general use case of LLMs for me is focus. I know how to break down a task, but sometimes I have an issue staying focused on doing it and having a reasonably competent partner to rubber duck with is super useful. It helps that the chat log then becomes an easy artifact to more or less copy paste, and chatgpt doesn't do a terrible job reformatting either. Like for 90% of the stuff it's easier than using vim commands.

It seems great for like straightforward linear code, elisp functions, python data massage scripts, that sort of thing. I had it take a shot at some new module for a high volume Go server with concurrency/locking concerns and nil pointer receivers. I got more panics from the little bit of code GPT wrote than all my own code, not because it was bad code but because when I use dangerous constructs like locking and pointers that can be nil, I have certain rigid rules for how to use them and the generated code did not follow those rules.

> Do you really think people are just blindly pushing code they can't make heads or tails of? That they haven't tested? Do you seriously think people are just one shotting code and blasting it into prod?

Yes, and I see proof of it _literally every day_ in Code Reviews where I ask juniors to describe or justify their choices and they shrug and say "That's what Copilot told me to put".

Yes that's exactly what they are doing.

I literally had someone with the balls to tell me that it was ChatGPT's fault.

Due diligence and intelligence has shit the fucking bed quite frankly.

Do you think ChatGPT has changed any of those answers from Yes to No? Because it hasn't.

People blindly copied stack overflow code, they blindly copied every example off of MSDN, they blindly copy from ChatGPT - your holier than thou statements are funny, and frankly most LLMs cannot leave a local maxima, so anyone who says they dont write any code anymore I frankly think they are not capable of telling the mistakes, both architecturally and specifically that they are making.

More and different prompting will not dig you out of the hole.

This. Most people I know that use LLMs to be super productive are like "make me a button, it's red" (hyperbolic statement but you know what I mean). I can do that faster and better myself.

When I'm deeply stuck on something and I think "let's see if an LLM could help here", I try (and actually tried many times) to recruit those prompting gurus around me that swear LLMs solve all their problems... and they consistently fail to help me at all. They cannot solve the problem at all and I'm just sitting there, watching the gurus spend hours prompting in circles until they give up and leave (still thinking LLMs are amazing, of course).

This experience is what makes me extremely suspicious of anyone on the internet claiming they don't write code anymore but refusing to show (don't tell!) -- when actually testing it in real life it has been nothing but disappointment.

In my view these models produce above average code which is good enough for most jobs. But the hacker news sampling could be biased towards the top tier of coders - so their personal account of it not being good enough can also be true. For me the quality isn't anywhere close to good enough for my purposes, all of my easy code is already done so I'm only left working on gnarly niche stuff which the LLMs are not yet helpful with.

For the effect on the industry, I generally make the point that even if AI only replaces the below average coder it will cause a downward pressure on above average coders compensation expectation.

Personally, humans appear to be getting dumber at the same time that AI is getting smarter and while, for now, the crossover point is at a low threshold that threshold will of course increase over time. I used to try to teach ontologies, stats, SMT solvers to humans before giving up and switching to AI technologies where success is not predicated on human understanding. I used to think that the inability for most humans to understand these topics was a matter of motivation, but have rather recently come to understand that these limitations are generally innate.

It is also a problem of ego.

It is difficult if you have been told all your life that you are the best, to accept the fact that a computer or even other people might be better than you.

It requires lot of self-reflection.

Real top-tiers programmers actually don’t feel threatened by LLMs. For them it is just one more tool in the toolbox like syntax highlighting or code completion.

They choose to use these tools based on productivity gains or losses, depending on the situation.

Not to diminish your point at all: I think it's also just a fear that the fun or interesting part of the task is being diminished. To say that the point of programming is to solve real world problems ('productivity') is true, but in my experience it's not necessarily true for the person doing the solving. Many people who work as programmers like to program (as in, the process of working with code, typing it, debugging it, building up solutions from scratch), and their job is an avenue to exercise that part of their brain.

Telling that sort of person that they're going to be more productive by skipping all the "time consuming programming stuff" is bound to hurt.

> Real top-tiers programmers actually don’t feel threatened by LLMs.

They should, because LLMs are coming for them also, just maybe 2-3 years later than for programmers that aren't "real top-tier".

The idea that human intellect is something especially difficult to replicate is just delusional. There is no reason to assume so, considering that we have gone from hole card programming to LLMs competing with humans in a single human lifetime.

I still remember when elite chessplayers were boasting "sure, chess computers may beat amateurs, but they will never beat a human grandmaster". That was just a few short years before the Deep Blue match.

The difference is that nobody will pay programmers to keep programming once LLMs outperform them. Programmers will simply become as obsolete as horse-drawn carriages, essentially overnight.

> They should, because LLMs are coming for them also, just maybe 2-3 years later than for programmers that aren't "real top-tier".

Would you be willing to set a deadline (not fuzzy dates) when my job is going to be taken by an LLM and bet $5k on that?

Because the more I use LLMs and I see their improvement rate, the less worried I am about my job.

The only thing that worries me is salaries going down because management cannot tell how bad they're burying themselves into technical debt and maintenance hell, so they'll underpay a bunch of LLM-powered interns... which I will have to clean up and honestly I don't want to (I've already been cleaning enough shit non-LLM code, LLMs will just generate more and more of that).

> Would you be willing to set a deadline (not fuzzy dates) when my job is going to be taken by an LLM and bet $5k on that?

This is just a political question and of course so long as humans are involved in politics they can just decide to ban or delay new technologies, or limit their deployment.

Also in practice it's not like people stopped traditional pre-industrial production after industrialization occurred. It's just that pre-industrial societies fell further and further behind and ended up very poor compared to societies that chose to adopt the newest means of production.

I mean, even today, you can make a living growing and eating your own crops in large swathes of the world. However you'll be objectively poor, making only the equivalent of a few dollars a day.

In short I'm willing to bet money that you'll always be able to have your current job, somewhere in the world. Whether your job maintains its relative income and whether you'd still find it attractive is a whole different question.

> The difference is that nobody will pay programmers to keep programming once LLMs outperform them. Programmers will simply become as obsolete as horse-drawn carriages, essentially overnight.

I don't buy this. A big part of the programmer's job is to convert vague and poorly described business requirements into something that is actually possible to implement in code and that roughly solves the business need. LLMs don't solve that part at all since it requires back and forth with business stakeholders to clarify what they want and educate them on how software can help. Sure, when the requirements are finally clear enough, LLMs can make a solution. But then the tasks of testing it, building, deploying and maintaining it remain too, which also typically fall to the programmer. LLMs are useful tools in each stage of the process and speed up tasks, but not replacing the human that designs and architects the solution (the programmer).

> > Real top-tiers programmers actually don’t feel threatened by LLMs.

> They should, because LLMs are coming for them also, just maybe 2-3 years later than for programmers that aren't "real top-tier".

Not worrying about that because if they've gotten to that point (note: top tier programmers also need domain knowledge) then we're all dead a few years later.

Re: Compensation expectations, I figured out a long time ago that bad programmers create bad code, and bad code creates work for good programmers.

If the amount of bad code is no longer limited by the availability of workers who can be trained up to "just below average" and instead anyone who knows how to work a touchscreen can make AI slop, this opens up a big economic opportunity.

One could hope, but in my view perception precedes reality and even if that is the reality the perception is that AI will lower compensation demands and those doing the layoffs/hiring will act accordingly.

You could also make the same claims about outsourcing, and while it appears that in most cases the outsourcing doesn't pay off, the perception that it would has really damaged CS as a career.

And like with outsourcing it starts with the jobs at the lower end of the skill range in an industry, and so people at the higher end don't worry about it, and later it expands and they learn that they too are not safe.

What happened a couple of decades ago in poetry [1] could happen now with programming:

> No longer is it just advertising jingles and limericks made in Haiti and Indonesia. It's quatrains, sonnets, and free-form verse being "outsourced" to India, the Philippines, Russia, and China.

...

> "Limericks are a small slice of the economy, and when people saw globalization creating instability there, a lot said, 'It's not my problem,'" says Karl Givens, an economist at Washington's Economic Policy Institute. "Now even those who work in iambic pentameter are feeling it."

[1] http://www.watleyreview.com/2003/111103-2.html

> I've found that I haven't written a line of code in weeks

Which is great until your next job interview. Really, it's tempting in the short run but I made a conscious decision to do certain tasks manually only so that I don't lose my basic skills.

ChatGPT voice interface plugged into the audio stream, with the prompt:

- I need you to assist me during a programming interview, you will be listening to two people, the interviewer and me. When the interviewer asks a question, I'd like you to feed me lines that seem realistic for an interview where I'm nervous, don't give me a full blown answer right away. Be very succinct. If I think you misunderstood something, I will mention the key phrase "I'm nervous today and had too much coffee". In this situation, remember I'm the one that will say the phrase, and it might be because you've mistaken me by the interviewer and I want you to "reset". If I want you to dig deeper than what you've provided me with, I'll say the key phrase "Let's dig deeper now". If I think you've hallucinated and want you to try again, I'll say "This might be wrong, let me think for just a minute please". Remember, other than these key phrases, I'll only be talking to the interviewer, not you.

On a second screen of some sort. Other than that, interviewers will just have to accept that nobody will be doing the job without these sort of assistants from now on anyway. As an interviewer I let candidates consult online docs for specific things already because they'll have access to Google during the job, this is just an extension of that.

I recently interviewed a number of people about their SQL skills. The format I used was to share two queries with them a couple days ahead of time in a google doc, and tell them I will ask them questions about those queries during the interview.

Out of maybe twenty people I interviewed this way, only three of them pointed out that one of the queries had a failing error in it. It was something any LLM would immediately point out.

Beyond that: the first question I asked was: "What does this query do, what does it return?" I got responses ranging from people who literally read the query back to me word by word, giving the most shallow and direct explanation of what each bit did step-by-step, to people who clearly summarized what the query did in high-level, abstract terms, as you might describe what you want to accomplish before you write the query.

I don't think anyone did something with ChatGPT live, but maybe?

It's not about avoiding hard work - the audience on HN skews wealthy due to heavy representation of skilled devs in their 30s+, but the average person does not earn anything close to FAANG salaries. Even most devs in general don't earn like that. The interview process being fairly well understood in general, any advantage that can possibly get a person from $60k/year to generationally-life-changing $300k/year will be used eventually.

And I wrote this as a knee-jerk reaction after reading the parent, I imagine people will be putting way more effort if it can get them a great job. And to be honest, if they can fool you, they can most likely do the job as well. Most of the industry tests at a higher skill level than what they actually require on the day to day anyway.

Not everyone is doing coding interviews. Some might struggle with a particular language due to lack of muscle memory, but can dictate the logic in pseudocode and can avoid pitfalls inferred from past experience. This sort of workflow is compatible with LLMs, assuming a sufficient background (otherwise one can't recognize when the output diverges from your intent).

I personally treat the LLM as a rubber duck. Often I reject its output. In other cases, I can accept it and refactor it into something even better. The name of the game is augmentation.

I sometimes get the idea from statements like this - and HN's focus on interviewing in general - that people are switching jobs a dozen times a year or something. How often are most people switching jobs? I've had 5 jobs in the last 20 years.

I'm old, and well-paid for my geographic region (but for various mostly stupid reasons utterly broke). No amount of value created (at least, for my skill level) will protect me from ageism and/or budget cuts.

This. I’ve been using elixir for ~6 months (guided by Claude) and probably couldn’t solve fizz buzz at a whiteboard without making a syntax error. Eek.

Who cares? If I'm hiring you to make a product, I care that the higher order logic is correct, that the requirements are all catered for, and that the code does reasonable things in all cases. Things I don't care about are FizzBuzz, programming on whiteboards, and not making syntax errors.

This is how companies fail. 5 years down the line no one is able to change anything in the system because it's so poorly architected (by being a bunch of Claude copypastes cobbled together) that it takes one month to do a one-day task (if it's even possible).

I can see how you got that impression from my comment (if you ignore how I mentioned architecture), so let me elaborate:

It's the opposite. FizzBuzz and getting the syntax right is what LLMs are good at... but there's so much more nuance at being experienced with a language/framework/library/domain which senior engineers understand and LLMs don't.

Being able to write Elixir assisted by an LLM does not mean you can produce proper architecture and abstractions even if the high level ideas are right. It's the tacit knowledge and second-order thinking that you should hire for.

But the thing is, if someone cannot write Elixir without syntax errors unless using an LLM, well, that's a extremely good proxy that they don't know the ins and outs of the language, ecosystem, best practices... Years of tacit knowledge that LLMs fail to use because they're trained on a huge number of tutorial and entry-level code ridden with the wrong abstractions.

The only code worse than one that doesn't work is one that kinda works unless your requirements change ever so slightly. That's a liability and you will pay it with interests.

To give a concrete example: I am very experienced with React. Very. A lot. The code that LLMs write for it is horrid, bug-ridden, inflexible and often misuses its footgun-y APIs like `useEffect` like a junior fresh out of a boot camp would, directly contradicting the known best practices for maintainable (and often even just "correct") code. But yeah it superficially solves the problem. Kinda. But good luck when the system needs to evolve. If it cannot do proper code that's <500 lines how do you expect it to deal with massive systems that need to scale to 10s of KLOC across an ever-growing twine?

But management will be happy because the feature shipped and time to market was low... until you can no longer ship anything new and you go out of business.

Ah, sorry, I read your comment as disagreeing with me, now I see it's the opposite. Exactly, LLMs (for now) are good at writing low-level code, but we need someone to work on architecture.

I had an idea the other day of an LLM system that would start from a basic architecture of an app, and would zoom down and down on components until it wrote the entire codebase, module by module. I'll try that, it sounds promising.

Somebody tested people on Hacker News to evaluate programming competency.

This was part of a larger evaluation comparing the Hacker News population to people on Reddit programming subreddits.

Here is a very heated discussion of the result:

https://news.ycombinator.com/item?id=33293522

It appears that Hacker News is perhaps NOT populated by the programming elite. In contrast, there are real wizards on Reddit.

Surprising, I know.

Not surprised there would be a “heated” discussion as a result of this one link, that measured only those who engaged it, and how? I opened the link, hit Submit just to see what would happen… now the percentage of HN users who are competent programmers is even fewer than before, by that metric.

I’ve made the decision to embrace being bad at coding but getting a ton of work done using an LLM and if my future employer doesn’t want massive productivity and would prefer being able to leetcode really well then I unironically respect that and that’s ok.

I’m not doing ground breaking software stuff, it’s just web dev at non massive scales.

You future employer might expect you to bring some value through your expertise that doesn't come from her LLM. If you want to insist on degrading your own employability like this, I guess it's your choice.

For the most part, businesses don't care how you deliver value, just that you do. If programmer A does a ticket in 3 days with an LLM, and programmer B takes a week to do the same ticket, but doesn't use an LLM, with programmer B choosing not to out of some notion of purity, who's more employable?

Productivity is not the only aspect of our profession that matters, and in fact it's probably not even the most important part. I'm not suggesting we get stuck or handcraft every aspect of our code, and there are multitudes of abstractions and tools that enhance productivity, including everything from frameworks to compilers.

What I'm saying is what the original comment is doing, having the LLM write all their code, will make them a less valuable employee in the long term. Participating in the act of programming makes your a better programmer. I'd rather have programmer B if they take the time to understand their code, so that when that code breaks at 4am and they get the call, they can actually fix it rather than be in a hole they dug with LLMs that they can't dig out of.

You don't need to call them at 4am, you can keep a git log of the prompts that were used to generate the code and some professional 4am debugger can sit there and use an LLM to fix it.

Probably not a practical option yet, but if we're looking at the long term that is where we are heading. Or, realistically, the even longer term where the LLM self-heals broken systems.

While a git log of prompts seems like a novel idea to me, I don't believe it would work - not because of temperature and LLMs being non-deterministic and the context window overflowing, but because at a certain level of complexity LLMs simply fail, even though they are excellent at fixing simple bugs.

LLMs are certainly not useless.

But "lines of code written" is a hollow metric to prove utility. Code literacy is more effective than code illiteracy.

Lines of natural language vs discrete code is a kind of preference. Code is exact which makes it harder to recall and master. But it provides information density.

> by just knuckling down and learning how to do the work?

This is the key for me. What work? If it's the years of learning and practice toward proficiency to "know it when you see it" then I agree.

For someone who didn't study a STEM subject or CS in school, I've gone from 0 to publishing a production modern looking app in a matter of a few weeks (link to it on my profile).

Sure, it's not the best (most maintainable, non-redundant styling) code that's powering the app but it's more than enough to put an MVP out to the world and see if there's value/interest in the product.

> Meanwhile, in the real world, I've found that I haven't written a line of code in weeks. Just paragraphs of text that specify what I want and then guidance through and around pitfalls in a simple iterative loop of useful working code.

Comment on first principles:

Following the dictum that you can't prove the absence of bugs, only their presence, the idea of what constitutes "working code" deserves much more respect.

From an engineering perspective, either you understand the implementation or you don't. There's no meaning to iteratively loop of producing working code.

Stepwise refinement is a design process under the assumption that each step is understood in a process of exploration of the matching of a solution to a problem. The steps are the refinement of definition of a problem, to which is applied an understanding of how to compute a solution. The meaning of working code is in the appropriateness of the solution to the definition of the problem. Adjust either or both to unify and make sense of the matter.

The discipline of programming is rotting when the definition of working is copying code from an oracle you run it to see if it goes wrong.

The measure of works must be an engineering claim of understanding the chosen problem domain and solution. Understanding belongs to the engineer.

LLMs do not understand and cannot be relied upon to produce correct code.

If use of an LLM puts the engineer in contact with proven principles, materials and methods which he adapts to the job at hand, while the engineer maintains understanding of correctness, maybe that's a gain.

But if the engineer relies on the LLM transformer as an oracle, how does the engineer locate the needed understanding? He can't get it from the transformer: he's responsible for checking the output of the transformer!

OTOH if the engineer draws on understanding from elsewhere, what is the value of the transformer but as a catalog? As such, who has accountability for the contents of the catalog? It can't be the transformer because it can't understand. It can't be the developer of the transformer because he can't explain why the LLM produces any particular result! It has to be the user of the transformer.

So a system of production is being created whereby the engineer's going-in position is that he lacks the understanding needed to code a solution and he sees his work as integrating the output of an oracle that can't be relied upon.

The oracle is a peculiar kind of calculator with a unknown probability of generating relevant output that works at superhuman speeds, while the engineer is reduced to an operator in the position of verifying that output at human speeds.

This looks like a feedback system for risky results and slippery slope towards heretofore unknown degrees of incorrectness and margins for error.

At the same time, the only common vernacular for tracking oracle veracity is in arcane version numbers, which are believed, based on rough experimentation, to broadly categorize the hallucinatory tendencies of the oracle.

The broad trend of adoption of this sketchy tech is in the context of industry which brags about seeking disruption and distortion, regards its engineers as cost centers to be exploited as "human resources", and is managed by a specialized class of idiot savants called MBAs.

Get this incredible technology into infrastructure and in control of life sustaining systems immediately!

> I've found that I haven't written a line of code in weeks

How are people doing this, none of the code that gpt4o/copilot/sonnet spit out i ever use because it never meets my standards. How are other people accepting the shit it spits out.

You're listing plain models, so I'm assuming you're using them directly. Aider and similar agents use those models but they don't step at the first answer. You can add test running and a linter to the request and it will essentially enter a loop like: what are the steps to solve (prompt)?; here's a map of the repository, which files do you need?; what's your proposed change?; here's the final change and the test run, do you think the problem has been solved?; (go back to the beginning if not)

See the video at https://plandex.ai/ to get an idea how it works.

Depends on the task but if you're going high level enough, it's not more work. Think about it this way: if you're doing proper development you're going to write code, tests and commit messages. Since you know what you want to achieve, write a really good commit message as the prompt, start writing tests and let the agent run in the meantime. Worst case, it doesn't work and you do the code yourself. Best case, it worked and you saved time.

(Not sure if that was clear but the steps/loop described before happens automatically, you're not babysitting it)

I'm using Cursor and till now the "test run" part is manual, like Cursor doesn't care about testing or actually checking the code it wrote works

Any tips how I could integrate that? Do I need to switch to aider/plandex?

I use sonet 3.5 and while it's actually usable for codegen (compared to gpt/copilot) it's still really not that great. It does well at tasks like "here's a stinky collection of tests that accrued over time - clean this up in style of x" but actually writing code still shows fundamental lack of understanding of underlying API and problem (the most banal example being constantly generating `x || Array.isArray(x)` test)

> I've found that I haven't written a line of code in weeks

Please post a video of your workflow.

It’s incredibly valuable for people to see this in action, otherwise they, quite legitimately, will simply think this is not true.

Who cares what they think? In fact, the fewer who uses this the better for the ones that do. It's not in my self-interest to convert anyone and I obviously don't need to convince myself when I have the result right in front of me. Whether you believe it or not does not make me less productive.

The obvious answer is you’ll get called a liar and shrill.

I’m not saying you are; I think there are a lot of legitimate AI workflows people use.

…but, there are a lot of people trying to sell AI, and that makes them say things about it which are just flat out false.

/shrug

But you know; freedom of speech; you can say whatever you want if you don’t care what people think of you.

My take on it is showing people things (videos, blogs, repos, workbooks like Terence posted) moves the conversation from “I don’t believe you” to “let’s talk about the actual content”. Wow, what an interesting workflow, maybe I’ll try that…

If you don’t want to talk to people or have a discussion that extends beyond meaningless trivia like “does AI actually have any value” (obviously flame bait opinions only comment threads)… why are you even here?

If you don’t care, then fine. Maybe someone else will and they’ll post an interesting video.

Isn’t that the point of reading HN threads? What do you win by telling people not to post examples of their workflow?

It’s incredibly selfish.

> HN, and the internet in general, have become just an ocean of reactionary sandbagging and blather about how "useless" LLMs are.

Now imagine how profoundly depressing it is to visit a HN post like this one, and be immediately met with blatant tribalism like this at the very top.

Do you genuinely think that going on a performative tirade like this is what's going to spark a more nuanced conversation? Or would you rather just the common sentiment be the same as yours? How many rounds of intellectual dishonesty do we need to figure this out?

> HN, and the internet in general, have become just an ocean of reactionary sandbagging and blather about how "useless" LLMs are.

This is cult like behaviour that reminds me so much of the crypto space.

I don't understand why people are not allowed to be critical of a technology or not find it useful.

And if they are they are somehow ignorant, over-reacting or deficient in some way.

I think it's perfectly ok to be critical of technology as long as one is thoughtful rather than dismissive. There is a lot of hype right now and pushing back against it is the right thing to do.

I'm more reacting against simplistic and categorical pronouncements of straight up "uselessness," which to me seems un-curious and deeply cynical, especially since it is evidentially untrue in many domains (though it is true for some domains). I just find this kind of emotional cynicism (not a healthy skepticism, but cynicism) to be contrary to the spirit of innovation and openness, and indeed contrary to evidence. It's also an overgeneralization -- "I don't find it useful, so it's useless" -- rather than "Why don't I find it useful, and why do others do? Let me learn more."

As future-looking HNers, I'd expect we would understand the world through a lens of "trajectories" rather than "current state". Just because LLMs hallucinate and make mistakes with a tone of confidence today -- a deep weakness -- doesn't mean they are altogether useless. We've witnessed that despite their weaknesses, we are getting a lot of value from them in many domains today and they are getting better over time.

Take neural networks themselves for instance. For most of the 90s-2000s, people thought they were a dead end. My own professor had great vitriol against Neural Networks. Most of the initial promises in the 80s truly didn't pan out. Turns out what was missing was (lots of) data, which the Internet provided. And look where we are today.

Another area of cynicism is self-driving cars (Level 5). Lots of hype and overpromise, and lots of people saying it will never happen because it requires a cognitive model of the world, which is too complicated, and there are too many exceptional cases for there to ever be Level 5 autonomy. Possibly true, but I think "never" is a very strong sentiment that is unworthy of a curious person.

I generally agree, although an important aspect of thinking in terms of "trajectories" is recognizing when a particular trajectory might end up at a dead end. One perspective on the weaknesses of current LLMs is that it's just where the things are today and they can still provide value even while the technology improves. But another perspective is that the persistence of these weaknesses indicates something more fundamentally broken with the whole approach that means it's not really the path towards "real" AI, even if you can finesse it into doing useful things in certain applications.

There's also an important nuance differentiating rejection of a general technological endpoint (e.g. AGI or Level 5 self-driving cars) with a particular technological approach to achieving those goals (e.g. current LLM design or Tesla's autopilot). As you said, "never" is a long time and it takes a lot of unwarranted confidence to say we will never be able to achieve goals like AGI or Level 5 self-driving. But it seems a lot more reasonable to argue Tesla or OpenAI (and everyone else doing essentially the same thing as OpenAI) are fundamentally on the wrong track to achieving those goals without significantly changing their approach.

I agree that none of that really warrants dismissive cynicism of new technology, but being curious and future-looking also requires being willing to say when you think something is a bad approach even if it's not totally useless. Among other reasons, our ability to explore new technology is not limitless, and hype for a flawed technology isn't just annoying but may be sucking all the oxygen out of the room not leaving any for a potentially better alternative. Part of me wants to be optimistic about LLMs, but another part of me thinks about how much energy (human and compute) has gone into this thing that does not seem to be providing a corresponding amount of value.

I appreciate this thoughtful comment.

You are absolutely right that the trajectories, if taken linearly, might hit a dead end. I should clarify that when I mentioned "trajectories" I don't mean unpunctuated ones.

I am myself not convinced that LLMs -- despite their value to me today -- will eventually lead to AGI as a matter of course, nor the type of techniques used in autopilot will lead to L5 autonomy. And you're right that they are consuming a lot of our resources, which could well be better invested in a possibly better alternative.

I subscribe to Thomas Kuhn's [1] idea of scientific progress happening in "paradigms" rather than through a linear accumulation of knowledge. For instance, the path to LLMs itself was not linear, but through a series of new paradigms disrupting older ones. Early natural language processing was more rule-based (paradigm), then it became more statistical (paradigm), and then LLMs supplanted the old paradigms through transformers (paradigm) which made it scale to large swaths of data. I believe there is still significant runway left for LLMs, but I expect another paradigm must supplant it to get closer to AGI. (Yann Lecun said that he doesn't believe LLMs will lead to AGI).

Does that mean the current exuberant high investments in LLMs are misplaced? Possibly, but in Kuhn's philosophy, typically what happens is a paradigm will be milked for as much as it can be, until it reaches a crisis/anomaly when it doesn't work anymore, at which point another paradigm will supplant it.

At present, we are seeing how far we can push LLMs, and LLMs as they are have value even today, so it's not a bad approach per se even though it will hit its limits at some point. Perhaps what is more important are the second-order effects: the investments we are seeing in GPUs (essentially we are betting on linear algebra) might unlock the kind of commodity computational power the next paradigm needs to disrupt the current one. I see parallels between this and investments in NASA resulting in many technologies that we take for granted today, and military spend in California producing the technology base that enabled Silicon Valley today. Of course, these are just speculations and I have no more evidence that this is happening with LLMs than anyone else.

I appreciate your point however and it is always good to step back and ask, non-cynically, whether we are headed down a good path.

[1] https://en.wikipedia.org/wiki/The_Structure_of_Scientific_Re...

This entire comment can be summarised as: everyone who doesn't think like me is wrong.

Not everyone is interested in seeing the world through the hopes and dreams of e/acc types and would prefer to see it as it is today.

LLMs are a technology. Nothing more. It can be as amazing or useless as anyone likes.

And this comment can be summarized as "Nuh uh, I'm right". When summarizing longer bits of text down to a single sentence, nuance and meaning gets lost, making the summarization ultimatele useless, contributing nothing to the discussion.

Crypro and AI have similarities and differences.

The similarities include intense "true believer" pitches and governments taking them seriously.

The differences include that the most famous cryptocurrency can't function as a direct payment mechanism for just lunch purchases in just Berlin (IIRC nor is it enough for all interbank transactions so it can't even be a behind-the-scenes system by itself), while GenAI output keeps ending up in places people would rather not find it like homework and that person on Twitter who's telling you Russia Did Nothing Wrong (and also giving you a nice cheesecake recipe because they don't do any input sanitation).

Also, I'm deeply skeptical of crypto too due to its present scamminess, but I am keeping an open mind that there is a future in which crypto -- once it gets over this phase of get-rich-quick schemers -- will be seen as just another asset class.

I read somewhere that historically bonds in their early days were also associated with scamminess but today they're just a vanilla asset.

I'm honestly more optimistic about cryptocurrency as a mechanism of exchange rather than an asset. As a mechanism of exchange, cryptocurrency has some actually novel properties like distributed consensus that could be useful in certain cases. But an asset class which has zero backing value seems unworkable except for wild speculation and scams. Unfortunately the incentives around most cryptocurrencies (and maybe fundamental to cryptocurrency as an idea) greatly emphasize the asset aspects, and it's getting to be long enough since it became a thing that I'm starting to become skeptical cryptocurrency will be a real medium of exchange outside of illegal activities and maybe a few other niche cases.

What sort of problems do you solve? I tried to use it. I really did. I've been working on a tree edit distance implementation base on a paper from 95. Not novel stuff. I just can't get it to output anything coherent. The code rarely runs, it's written in absolutely terrible style, it doesn't follow any good practices for performant code. I've struggled with getting it to even implement the algorithm correctly, even though it's in the literature I'm sure it was trained on.

Even test cases have brought me no luck. The code was poorly written, being too complicated and dynamic for test code in the best case and just wrong on average. It constantly generated test cases that would be fine for other definitions of "tree edit distance" but were nonsense for my version of a "tree edit distance".

What are you doing where any of this actually works? I'm not some jaded angry internet person, but I'm honestly so flabbergasted about why I just can't get anything good out of this machine.

This kind of problems is really not where LLMs shine.

Where you save loads of time is when you need to write lots of code using unfamiliar APIs. Especially when it's APIs you won't work with a lot and spending loads of time learning then would just be a waste of time. In these cases LLMs call tell you the correct API cells and it's easy to verify. The LLM isn't really solving some difficult technical problem, but saves lots of work.

This exactly. LLMs can't reason, so we shouldn't expect them to try. They can do translation extremely well, so things like converting descriptions to 90-95% correct code in 10-100x less time, or converting from one language to another, are the killer use cases IMO.

But expecting them to solve difficult unsolved problems is a fundamental misunderstanding of what they are under the hood.

I picked this problem specifically because it's about "converting from one language to another". The problem is already solved in the literature. I understand that doing cutting edge research is a different problem, and that is explicitly not what I'm doing here, nor what I am expecting of the tool. I have coauthored an actual published computer science paper, and this excercise is VERY far from the complexity of that.

Could you share some concrete experience of a problem where aider, or a tool like it, helped you? What was your workflow, and how was the experience?

I'm a senior engineer (as in, really senior, not only years of experience). I can get familiar with unfamiliar APIs in a few hours and then I can be sure I'm doing the right thing, instead of silently failing to meet edge cases and introducing bugs because I couldn't identify what was wrong in the LLM output (because, well, I'm unfamiliar with the API in the first place).

In other words: LLMs don't solve any noteworthy problems, at least yet.

Maybe for happy path cases. I've tried to ask ChatGPT how you can do a certain non-obvious thing with Kafka, and it just started inventing things. Turns out, that thing isn't actually possible to do with Kafka (by design).

I think that contemporary models are trained for engagement, not for actual help.

My experience is the same as yours, but I noticed that while LLMs circa two years ago tried to come up with the answer, current generation of LLMs tries to make me come with the answer. And that not helping at all.

Did you tell it that? Are you trying to converse and discuss or are you trying to one shot stuff? If it gets something wrong, tell it. Don't just stop and try another prompt. You have to think of it as another person. You can talk to it, question it, guide it.

Try starting from ground zero and guiding it to the solution rather than trying to one shot your entire solution in one go.

I want you to implement this kind of tree in language x.

Ok good, now I want you to modify it to do Y.

Etc.

I've tried both. One time I actually tried so hard that I ran out of context, and aider just dumped me back to the main prompt. I don't think It's possible to guide it any more than that.

My problem is that the solution is right there in the paper. I just have to understand it. Without first understanding that paper, I can't possibly guide the AI towards a reasonable implementation. The process of finding the implementation is exactly the understanding of the paper, and the AI just doesn't help me with that. In fact, all too often I would ask it to make some minor change, and it would start making random changes all over the file, completely destroying my mental model of how the program worked. Making it change that back completely pulls me out of the problem.

When it's a junior at my job, at least I can feel like I'm developing a person. They retain the conversation and culture I impart as part of the problem solving process. When I struggle against the computer, it's just a waste of my time. It's not learning anything.

I'm still really curious what you're doing with it.

> Meanwhile, in the real world, I've found that I haven't written a line of code in weeks. Just paragraphs of text that specify what I want and then guidance through and around pitfalls in a simple iterative loop of useful working code.

could it be that you are mostly engaged in "boilerplate coding", where LLMs are indeed good?

People in general don't like change and are naturally defending against it. And the older people get the greater the percentage of people fighting against it. A very useful and powerful skill is to be flexible and adaptable. You positioned yourself in the happy few.

Given how hard we thought programming was a year or two ago, I wouldn't bank my future on design being too hard for an LLM. They're already quite good at helping writing design docs.

Lol nope. When I'm trying to get it do make something big/complicated I start by telling it it's a software project manager and have me build a spec sheet on the design. Then I hand that off to an architect to flesh out the languages, libraries, files needed etc. Then from that list you can have it work on individual files and functions.

I also do OR-adjacent work, but I've had much less luck using 4o for formulating MIPs. It tends to deliver correct-looking answers with handwavy explanations of the math, but the equations don't work and the reasoning doesn't add up.

It's a strange experience, like taking a math class where the proofs are weird and none of the lessons click for you, and you start feeling stupid, only to learn your professor is an escaped dementia patient and it was gobbledygook to begin with.

I had a similar experience yesterday using o1 to see if a simple path exists through s to t through v using max flow. It gave me a very convincing-looking algorithm that was fundamentally broken. My working solution used some techniques from its failed attempt, but even after repeated hints it failed to figure out a working answer (it stubbornly kept finding s->t flows, rather than realizing v->{s,t} was the key.)

It's also extremely mentally fatiguing to check its reasoning. I almost suspect that RLHF has selected for obfuscating its reasoning, since obviously-wrong answers are easier to detect and penalize than subtly-wrong answers.

Yip. We need research into how long it takes experts to repair faulty answers, vs. generate them on their own.

Benchmarking 10,000 attempts on an IQ test is irrelevant if on most of those attempts the time taken to repair an answer is long than the time to complete the test yourself.

I find its useful to generate examplars in areas you're roughly familiar with, but want to see some elaboration or a refresher. You can stich it all together to get further, but when it comes time to actually build/etc. something -- you need to start from scratch.

The time taken to reporduce what it's provided, now that you understand it, is trivial compared to the time needed to repair its flaws.

I’m amazed you have had any luck with 4o. I found 4 was much better than 4o but still quite bad.

I tried to use 4/4o for a MIP several months ago. Frequently, it would iterate through three or four bad implementations over and over.

Claude 3.5 has been a significant improvement. I don’t really use chatgpt for anything at this point.

I'm currently teaching a course on MIP, and out of interest I tried asking 4o about some questions I ask students. It could give the 'basic building blocks' (How to do x!=y, how to do a knapsack), but as soon as I asked it a vaguely interesting question that wasn't "bookwork", I don't think any of it's models were right.

I'm interested on how you seem to be getting better answers than me (or, maybe I just discard the answer once I can see it's wrong and write it myself, once I see it's wrong?)

In fact, I just asked it to do (and explain) x!=y for x,y integer variables in the range {1..9}, and while the constraints are right, the explanation isn't.

OK, but at that point you've told it basically everything, and this is a really basic book problem!

As another example I just gave it a network flow problem, and asked it to convert to maximum flow (I'm using the API, not chatGPT).

Despite numerous promptings, it never got it right -- it would not stop putting a limit on the source and sink (usually 1), which mean the flow was always exactly 1, here's the bit of wrong code (it's the last part, it's shouldn't be putting any restrictions on nmap['s'] and nmap['t'], as they represent the source and sink), and I couldn't pursade it this was wrong after several prods:

    # Constraints: Ensure flow conservation at each vertex
    A_eq = np.zeros((len(namelist), num_edges))
    b_eq = np.zeros(len(namelist))

    for i, (u, v, capacity) in enumerate(edges):
        A_eq[nmap[u], i] = 1  # Outflow from u
        A_eq[nmap[v], i] = -1  # Inflow to v

    # Source 's' has a net outflow, and sink 't' has a net inflow
    b_eq[nmap['s']] = 1
    b_eq[nmap['t']] = -1

Sure, but that is nature of LLM prompting. It does take some doing to set up the right guardrails. It's still a good starting point.

Also a trick when the LLM fights you: start from scratch, and put guardrails in your initial prompt.

LLM prompting is a bit like gradient descent in a bumpy nonconvex landscape with lots of spurious optima and saddle points -- if you constrain it to the right locality, it does a better job at finding an acceptable local optimum.

I think this is just a case of different people wanting to work differently (and that's fine).

I can only tell this is wrong because I fully understand it -- and if I fully understand it, why not just write it myself rather than fight against an LLM. If I was trying to solve something I didn't know how to do, then I wouldn't know it was wrong, and where the bug was.

That's true, except an LLM can sometimes propose a formulation that one has never thought of. In nuanced cases, there is more than one formulation that works.

For MIPs, correctness can often (not always but usually) be checked by simply flipping the binaries and checking the inequalities. Coming up the inequalities from scratch are not always straightforward so LLMs often provide good starting points. Sometimes the formulation is something specific from a paper that that one has never read. LLMs are a way to "mine" those answers (some sifting required).

I think this the mindset that is needed to get value out of LLMs -- it's not about getting perfect answers on textbook problems, but working with an assistant to explore the space quickly at a fraction of the effort.

I an also working in OR and I have had the complete opposite experience with respect to MILP optimization(and the research actually agrees; there was a big survey paper published earlier this year showing LLMs were mostly correct on textbook problems but got more and more useless as complexity and novelty increased.)

The results are boiler plate at best, but misleading and insidious at worst, especially when you get into detailed tasks. Ever try to ask a LLM what a specific constraint does or worse ask it to explain the mathematical model of some proprietary CPLEX syntactic sugar? It hallucinates the math, the syntax, the explanation, everything.

Can you point me to that paper? What version of the model were they using?

Have you tried again with the latest LLMs? ChatGPT4 actually (correctly) explains what each constraint does in English -- it doesn't just provide the constraint when you ask it for the formulation. Also, not sure if CPLEX should be involved at all -- I usually just ask it for mathematical formulations, not CPLEX calling code (I don't use CPLEX). The OR literature primarily contains math formulations and that's where LLMs can best do pattern matching to problem shape.

Many of the standard formulations are in here:

https://msi-jp.com/xpress/learning/square/10-mipformref.pdf

All the LLM is doing is fitting the problem description to a combination of these formulations (and others).

I was referring to section 4 of A Survey for Solving Mixed Integer Programming via Machine Learning(2024): https://arxiv.org/pdf/2401.03244.

I’ve heard (but not so much observed) that there is substantial difference between recent models, so it’s possible that they are better than when this was written.

Anyways, CPLEX has an associated modeling language that features syntactic sugar which has the effect of providing opaqueness to the underlying MILP that it solves. I find LLMs essentially unable to even make an attempt at determining the MILP from that language.

PS: How is Xpress? Is there some reason to prefer it to Gurobi or Mosek?

Thanks for sharing that, I appreciate it. It looks like they used open-source Llama models which are not great. I tested these models offline using Ollama and outside of being character chat bots, they weren't very good at much (the only models that give good answers are Sonnet 3.5 or ChatGPT 4). However the paper's conclusion is essentially correct even for state-of-the-art models:

"Overall, while LLM made several errors, the provided formulations can serve as a starting point for OR experts to create mathematical models. However, OR experts should not rely on LLM to accurately create mathematical models, especially for less common or complex problems. Each output needs to be thoroughly verified and adjusted by the experts to ensure correctness and relevance."

I wouldn't recommend anyone inexperienced to use LLMs to create entire models from scratch, but rather use LLMs as a search tool for specific formulations which are then verified and plugged into a larger model. For this, it works really well and saves me a ton of time. As MIP modeler, I have an intuition on the shape of the answer, so even if ChatGPT makes mistakes, I know how to extract the correct bits and it still saves me a ton of time.

The CPLEX API doesn't have a lot of good examples out in the wild, so I don't expect the training to be good. I've always used CPLEX through a modeling language like AMPL, and even AMPL code is rare so I can't expect an LLM to decipher any of it. On the other hand, MIP formulations abound in PDFs of journal publications.

In the vibes department, I feel Xpress is second to Gurobi and CPLEX and it does the job just fine. But it's been a while since I used CPLEX and Gurobi so I have no recent points of comparison (corporate licensing is prohibitively expensive).

I had the same experience with computational geometry.

Very good at giving a textbook answer ("give a Python/ Numpy function that returns the Voronoi diagram of set of 2d points").

Now, I ask for the Laguerre diagram, a variation that is not mentioned in textbooks, but very useful in practice. I can spend a lot of time spoon-feeding the answer, I just have the bullshiting student answers.

I tried other problems like numerical approximation, physics simulation, same experience.

I don't get the hype. Maybe it's good at giving variations of glue code ie. Stack Overflow meet autocomplete ? As a search tool it's bad because it's so confidently incorrect, you may be fooled by bad answers.

It also doesn't help that Lean has had so many breaking changes in such little time. When I tried using GPT-4 for it, it mostly rendered old code that would fail to run unless you already knew the answer and how to fix it, which basically made it entirely unhelpful.

> people who complain on HN that (paid/good - only Sonnet 3.5 and GPT4o are in this category)

Correction: I complain that the only decent model in "Open"AI's arsenal, that is GPT-4, has been replaced by a cheaper GPT-4o, which gives subpar answers to most of my question (I don't care it does it faster). As they moved it to "old, legacy" models, I expect they will phase it out, at which point I'll cancel my OpenAI subscriptions and Sonnet 3.5 will become the clear leader for my daily tasks.

Kudos to Anthropic for their great work, you guys are going in the right direction.

I'm not sure the lean coverage of pure math research is that much (maybe like 1% is represented on mathlib). But I think a system like alpha proof could even today be useful for mathematicians--I mostly dislike systems like o1 where they confidently say nonsense with such high frequency. But i think value is already there.

_can_ GPT be tuned more heavily on Lean? It looks like the amount of python code in the corpus would outnumber Lean something like 1000:1. Although I guess OpenAI could generate more and train on that.

I take cynicism over unbridled optimism. People speak as if we were on the cusp of technological singularity, but I've seen nothing to indicate we're not already past the inflection point of the logistic curve, and well into diminishing returns territory.

Yeah I have been using them to help with learning graduate maths as a grad student. Claude Sonnet 3.5 was unparalleled and the first quite useful one. GPT4o preview seems about equal (based on cutting and pasting the past six months of prompts into it).

But many of the low-effort comments seem to mostly fall into (1) and (2) -- cynicism rather than cautious optimism.

One good riposte to reflexive LLM-bashing is, "Isn't that just what a stochastic parrot would say?" Some HN'ers would dismiss a talking dog because the C code it wrote has a buffer overflow error.

It's understandable that people whose career and lifelong skill set that are seemingly on the precipice of obsolescence are going to be extremely hostile to that threat.

How many more years is senior swe work going to be a $175k/yr gig instead of an $75k check-what-the-robot-does gig?

> side

Or (4) LLMs simply do not work properly for many use cases in particular where large volumes of trained data doesn't exist in its corpus.

And in these scenarios rather than say "I don't know" it will over and over again gaslight you with incoherent answers.

But sure condescendingly blame on the user for their ignorance and inability to understand or use the tool properly. Or call their criticism low-effort.

> but for someone who can and does, the $20/month I pay for ChatGPT more than pays for itself.

Would you be willing to pay even more, if it meant you were getting proportionally more valuable answers?

E.g. $200/month or $2,000/month (assuming the $2,000/month gets into employee/intern/contractor level of results.)

This might drive a positive feedback loop.

Rewind your mind to 2019 and imagine reading a post that said

“The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.”

With regard to interacting with the equivalent of Alexa. That’s a remarkable difference in 5 years.

The first profession AI seems on track to decimate is programming. In particular, the brilliant but remote and individual contributor. There is an obvious conflict of interest in this forum.

I see this theory a lot but mostly from people who haven’t tried pair coding with a quality llm. In fact these llms give experienced developers super powers; you can be crazy productive with them.

If you think we are close to the maximum useful software in the world already, then maybe. I do not believe that. Seeing software production and time costs drop one to two orders of magnitude means we will have very different viable software production processes. I don’t believe for a second that it disenfranchises quality thinkers; it empowers them.

I abandoned 3D art after witnessing DALL-E 2's capabilities, and I've observed the ripple effects across creative fields. Initially, photographers and fellow artists dismissed AI as a non-threat. That turned out to be misguided optimism. Now, with Midjourney producing such impressive work, the majority of us have become largely obsolete. These days, I'm noticing developers exhibiting the same denial. From my perspective, they're on a similar trajectory. This AI revolution is impacting creative and technical industries far more rapidly and dramatically than most anticipated..

Sure, Midjourney work looks visually impressive, but have you seen evidence that it really is displacing professional 3D artists? Are legitimate companies genuinely switching to Midjourney over hiring artists now, or is Midjourney usage still mostly happening in places that previously wouldn't have commissioned custom illustrations at all (instead using things like stock photography).

I totally agree, there e.g. so many companies out there who rely on fully manual processes internally simply because they cannot currently afford to hire programmers to solve the problems they have for them. The ROI just isn't there.

Reduce costs by an order of magnitude or two, and suddenly there's a whole heap more projects that become profitable.

The programmers who will find LLMs most useful are going to be those who prior to LLMs were copying and pasting from Stack Overflow, and asking questions online about everything they were doing - tasks that LLMs have precisely replaced (it has now memorized all that boilerplate code, consensus answers, and API usage examples).

The developers who will find LLMs the least useful are the "brilliant" ones who never found any utility in any of that stuff, partly because they are not reinventing the wheel for the 1000th time, but instead addressing more challenging and novel problems.

95% sounds way high, but maybe I'm wrong. I think it's part generational - old school programmers are used to having to develop algorithms/etc from scratch, and the younger generation seem to have been taught in school to be more system integrators assembling solutions out of cut and paste code and relying on APIs to get stuff done (with limited capability to DIY if such an API does not exist).

But not all younger programmers can be Stack Overflow cut-n-pasters, because not all (and surely not 95%!) programming jobs are amenable to that approach. There are lots of jobs where people are developing novel solutions, interacting with proprietary or uncommon hardware and software, etc, where the solution does not exist on Stack Overflow (and by extension not in an LLM trained on Stack Overflow).

Before it can replace the brilliant programmer, it needs to be able to replace the mediocre programmer. There is so much programming and other tech/it related work that businesses or people want, but can't justify paying even low tech salaries in America for.

So far, there is little chance of a non-technical person developing a technical solution to their problems using AI.

> Before it can replace the brilliant programmer, it needs to be able to replace the mediocre programmer

Nope. Compensation is exponential. Being able to replace a top performer with a fee mediocre devs pair coding with an LLM is more than fine for 90% of use cases.

A mediocre programmer won't be able to judge the allegedly expert level output any better than a non-programmer, so I don't see how that would work.

I think it is more likely that great programmers might just increase their productivity even more with, which will make their value even greater.

> mediocre programmer won't be able to judge the allegedly expert level output any better than a non-programmer, so I don't see how that would work

Sure. Plenty of businesses are. Particularly in the commercial automation sector that numerically hires the most people.

> more likely that great programmers might just increase their productivity

For those in high-productivity, high-margin businesses, yes. For most of the world, no—the surplus productivity doesn’t outweigh the compensation and concentration risk.

I broadly expect a spate of age discrimination lawsuits in the near future because most businesses don’t need a few stars. In the meantime, I’ve watched a lot of people find two people in Brazil + an LLM equals one WFH very good (but not brilliant) coder.

> there are problems that 'brilliant' programmers can solve and no number of mediocre ones ones

These people will continue to have value. But most businesses don’t have problems that can be profitable solved only by brilliant coders.

The large majority of professional writers and artists produce thankless commodity output for things like TV advertisements, games, SEO content. These jobs should be threatened.

I think most code being written is like a custom car made out of the most cost effective parts available.

Not pretty, but it gets the job done for the specific use cases of a given business.

Real production code doesn’t and have a shutter stock equivalent.

If you think most code is stock, then you just haven’t had enough experience in industry yet.

I believe LLMs decimating the role of a software engineer requires AGI, which the second that happens decimates all jobs.

What it may do is change the job requrements. Web/JS has decimated (reduced by 90% or more) MFC C++ jobs after all.

The programmer doesnt just write Python. That is the how... not the what.

The conflict of interest might have something to do with the fact that OpenAI's CEO/founder was once a major figure in Y Combinator. But I think you wanted to insinuate that the conflict of interest ran in the other direction.

Once ChatGPT can even come close to replacing a junior engineer, you can retry your claim. The progression of the tech underlying ChatGPT will be sub-linear.

Once we have AI-based language servers, which will, at some point in the future, be able to track entire repositories, I think maintaining projects will actually be far easier than right now.

I doubt it. It can do some impressive stuff for sure, but I very rarely get a perfectly working answer out of ChatGPT. Don't get me wrong, it's often extremely useful as a starting point and time saver, but it clearly isn't close to replacing anyone vaguely competent.

Which is why I think the AI era isn't hype but very much real. Jensen said AI has reached the era of iPhone.

We wont have AGI or ASI, whatever definition people have with those terms in the next 5 - 10 years. But I would often like to refer AI as Assisted or Argumented Intelligence. And it will provide enough value that drives current Computer and Smartphone sales for at least another 5 - 10 years. Or 3-4 cycles.

The important point is, I feel, that most people are not even at the level of intelligence of a "a mediocre, but not completely incompetent, graduate student." A mediocre graduate science student, especially of the sort who graduates and doesn't quit, is a very impressive individual compared to the rest of us.

For "us", having such a level of intelligence available as an assistant throughout the day is a massive life upgrade, if we can just afford more tokens.

My sheer productivity boost from these models is miraculous. It's like upgrading from a text editor to a powerful IDE. I've saved a mountain of hours just by removing tedious time sinks -- one-off language syntax, remembering patterns for some framework, migrating code, etc. And this boost applies to nearly all of my knowledge work.

Then I see contrarians claiming that LLMs are literally never useful for anyone, and I get "don't believe your lying eyes" vibes. At this point, such sentiments feel either willfully ignorant, or said in bad faith. It's wild.

> At this point, such sentiments feel either willfully ignorant, or said in bad faith.

I feel exactly the same, but in the opposite direction.

As someone who’s been programming for 17 years and working professionally for 10, I’m unable to get any huge productivity boosts from AI tools. They’re better than Google+stack overflow for asking random questions, but in a specific context and they’re good for repetitive, but not identical, syntax. That’s about where the gains end for me.

Maybe at this point I’m just so fast about looking up documentation. Maybe the languages/problems I’m facing aren’t well represented in the training data, but I just don’t see this amazing advancement.

I’d really love to see, live, someone programming who really gets these big productivity gains.

Right, in my experience the time it takes to verify that the code it wrote for you is correct is more than just to write it in the first place. A big exception is if you're working in a new domain (e.g., new language or framework). Then it's obviously much faster, and I do derive value from it. But I don't spend a very large % of my time doing that.

I would speculate it's a productivity boost for programmers specifically working in areas that they are new to (or haven't really mastered yet). One question I have is whether overly relying on LLMs will reduce the ability to master a domain, and thus hurt your long-term skill. It might seem silly, like complaining that no one knows assembly anymore because of compilers, but I think it's different than just another layer of abstraction.

I have it, tried it for a while. I have it turned mostly off new except for rare boilerplate heavy cases.

It kept generating annoyingly wrong code. Things with subtly wrong misleading names, missing edge cases, ignoring immediate same file context etc. I found that it slowed me down so i turned it off.

Same experience, but with TypeScript and Go. They gave me a 60-day trial (IIRC), I used it for two days, disabled it for the next 58 days, and after that removed it from the editor.

I get really good results with TypeScript and Python. Like it knows exactly what I want to do, I feel like I think exactly as Copilot does. Maybe I am the statistical average...

Makes me wonder if people who don't like Copilot output will not like my natural output as well.

Feel free not to share, I don’t want you to get dogpiled, but if you would humor me,

Could you share any code on GitHub (or pastebin or whatever) that you wrote with the help of AI?

Or could you share what kind of experience you have with programming (how many years, what domain you work in, etc)

I tried it for a while and thought it was helping a lot. Then I happened to use an IDE without it and realized it was increasing my rate of syntax tokens per hour but reducing the rate of features implemented per hour. In particular I was constantly rewriting boilerplate instead of ever writing helper functions.

> I see contrarians claiming that LLMs are literally never useful for anyone

While I don't doubt that there's at least one person that has said this, what you're saying doesn't conflict with the things I and many others in the "skeptic" camp have said. LLMs are useful for a very specific set of tasks. The tasks you've listed are a tiny sliver of all the tasks that AI could potentially be doing. Would it be a good idea to consult an LLM if your mother is passed out on the floor? Probably not. The problem I have is with extrapolating from the current successes to conclude that many more tasks will be done by AI in five years.

Thing is, I'm used to hearing a very similar sentiment on how e.g. using vim keybindings is so literally going to make me a 10x 100x whatever rockstar developer - and it's like what, enabling me to edit text a bit faster? And it's always anecdotes that yeah, from-qualia you feel so fast. But from-qualia I run like a marathon runner and sound like a radio host.

I personally did find some use cases for it and it does a decent job of cutting out minor gruntwork for me. But the experience itself screams to me that whatever gains I'm feeling I'm getting are all in my head.

> using vim keybindings is so literally going to make me a 10x 100x whatever rockstar developer - and it's like what, enabling me to edit text a bit faster?

Yes, to me LLM is exactly like this: from nano to vim.

I don't think basic vim usage (which is all I know, really) makes anyone super efficient. I don't think typing/editing speed is generally an important factor in programmer productivity or 'coding speed'.

It's just that every time I use nano it's (a) unintentional, as it's opened via EDITOR; (b) sort-of coerced, because most distros installing it by default also think it's somehow too much to install Vim or Emacs alongside it; and (c) extremely painfully awkward, because all other editors I use, I've invested at least as couple years of practice into.

If I spent a year using nano every day, and if I evolved a config file and read the manual during that time, I might eventually reach a place where using nano didn't feel cumbersome and irritating, but why would I do that if I already use Emacs and Vim every day? If I learn a 'new' editor it's going to be something extensible that I could see myself programming in every day: Emacs without evil; or one of the newer modal editors with a reversed sentence order, like kakoune and Helix; or, hell, VSCode.

So nano is likely doomed to remain forever cumbersome and irritating for me, somewhere on the level of typing on a touchscreen instead of a real keyboard.

I think there may be a set of people that have figured out, 1) how to interact with LLMs; and 2) what in their lives is improved when interacting with LLMs. I am in the group that has not found the best use case for my own life, and have never needed it for improving anything I need to get done. Always looking for suggestions, though!

> I think there may be a set of people that have figured out, 1) how to interact with LLMs....

1) is all about experimenting, which is what Tao is doing.

Having a playful and open minded attitude is like 80% of the game

I'm a contrarian who believes your anecdote, and could even imagine that 5% of LLM users feel the same way, but thinks (a) these systems are about half as good as they're ever going to get, (b) we're past the point of diminishing returns, and (c) what we do have isn't worth the energy costs of running it, let alone creating it in the first place.

Anyone intelligent enough to make a living programming likely has more than enough IQ to become a mediocre somewhat competent graduate student in math.

They just don't have the background, and probably lack the interest to dedicate studying for a few years to get to that level.

That's interesting take, personally I'd say that graduate-level math is orders of magnitude harder than significant majority of programming. And I mean that it's inherently harder, i.e. not due to lack of background.

>A mediocre graduate science student, especially of the sort who graduates and doesn't quit, is a very impressive individual compared to the rest of us.

Incorrect. University graduates shows a good work ethic, a certain character and a ability to manage time. It's not a measure of being better than the rest of humanity. Also, it's not a good measure of intelligence. If you only want to view the world through credentials. Academics don't consider your intelligence until you have a Ph.D and X years of work in your field. Industry only uses graduates as a entry requirement for junior roles and then favors and cares only about your years of experience after that. Given that statement I can only assume you haven't been to University. You are mistaken to think, especially in time we are in now that the elite class are any more knowledgeable then you are.

Here are the key points outlining why thewanderer1983's response misinterprets noch's comment and contains inaccuracies:

    Misinterpretation of the Original Point:
        Intelligence vs. Moral Superiority: Noch discusses the intelligence level of a mediocre graduate science student compared to the general population. Thewanderer1983 misreads this as a claim of moral or inherent superiority over "the rest of humanity," which was not implied.

    Conflation of Educational Levels:
        University Graduates vs. Graduate Students: The response conflates undergraduate university graduates with graduate science students. Noch specifically refers to graduate students who have pursued advanced degrees, which typically require higher levels of specialization and intellectual rigor.

    Incorrect Assessment of Intelligence Measures:
        Graduate Studies as a Measure of Intelligence: Successfully completing graduate studies, especially in science, often requires significant intellectual capability. Dismissing this as "not a good measure of intelligence" overlooks the challenges inherent in advanced academic work.

    Irrelevant Focus on Credentials and Industry Practices:
        Credentials vs. Intelligence Discussion: Noch's comment centers on intelligence levels, not merely on holding credentials. Bringing up how industry values experience over degrees shifts the focus away from the original discussion about intelligence.

    Unfounded Assumptions About Noch's Background:
        Ad Hominem Attack: Suggesting that Noch hasn't been to university is an unfounded personal assumption that does not contribute to the argument and detracts from a respectful discourse.

    Introduction of the 'Elite Class' Notion:
        Straw Man Argument: Thewanderer1983 introduces the concept of an "elite class," which Noch did not mention. This misrepresents the original comment and argues against a point that wasn't made.

    Overgeneralizations About Academia and Industry:
        Academia's Recognition of Intelligence: Claiming that academics don't consider intelligence until one has a Ph.D. and years of work is an overgeneralization. Intelligence is recognized and valued at various academic levels.
        Industry's View on Graduates: Stating that industry only uses graduates as an entry requirement ignores the significant roles that advanced degree holders often play in innovation and leadership within industries.

    Ignoring the Core Benefit Highlighted:
        AI as a Life Upgrade: Noch emphasizes how access to AI with the intelligence level of a graduate student is a substantial benefit for most people. Thewanderer1983 fails to address this key point, instead focusing on unrelated issues.

    Misunderstanding of the Value of Graduate Education:
        Work Ethic vs. Intellectual Achievement: While a good work ethic is important, graduate education in science also demands high intellectual capability, critical thinking, and problem-solving skills.

    Logical Fallacies:
        Red Herring: The discussion about industry preferences and academic credentials diverts from the main argument about the intelligence level of graduate students.
        Ad Hominem: Attacking Noch's presumed lack of university experience instead of addressing the argument presented.

An excellent example of an LLM (or an imitated LLM output) that fiercely defends the status quo, is overly verbose, does not come to the point, makes incorrect assumptions and lectures from a high horse.

LLMs are good for mediocre poems and presidential speeches that have no shame.

Even more amazing, there plenty - PLENTY - of posters here that routinely either completely shit on LLMs, or casually dismiss them as "hype", "useless", and what have you.

I've been saying this for quite some time now, but some people are in for a very rude awakening when the SOTA models 5-10 years from now are able to completely replace senior devs and engineers.

Better buckle up, and start diversifying your skills.

The way I see it these models especially O1 is an intelligence booster. If you start with zero it gives you back zero. Especially if you’re just genuinely trying to use it and not just trying to do some gotcha stuff.

Diversifying to what? When AI can fully replace senior developers the world as we know it is over. Best case capitalism enters terminal decline: buy rifles. Worst case, hope that whatever comes out the either side is either benevolent or implodes quickly.

I mean paying several hundred to thousands of grad students to RLHF for several years and you get a corpus of grad-student text. I'm not surprised at all. AI companies hire grad students to RLHF in every subject matter (chemistry, physics, math, etc).

The grad-students write the prompts, correct the model, and all of that is fed into a "more advanced" model. It's corpi of text. Repeat this for every grade level and subject.

Ask the model that's being trained on chemistry grad level work a simple math question and it will probably get it wrong. They aren't "smart". It's aggregations of text and ways to sample and then predict.

Except you’re talking about a general purpose foundation model that’s doing all these subjects at once. It’s not like you choose the subject specific model with Claude or gpt-01.

The key isn’t whether these things are smart or not. The key is that they put something that can answer basic grad level questions on almost any subject. For people that don’t have a graduate level education in any subject this is a remarkable tool.

I don’t know why the statement that “wow this is useful and a remarkable step forward” is always met with “yeah but it’s not actually smart.” So? Half of all humans have an IQ less than 100. They’re not smart either. Is this their value? For a machine, being able to produce accurate answers to most basic graduate level questions is -science fiction- regardless of whether it’s “smart.”

The NLP feat alone is stunning, and going from basically one step above gibberish to “basic grad school” in two years is a mouth dropping rate of change. I suspect folks who quibble over whether it’s “real intelligence” or simply a stochastic parrot have lost the ability to dream.

Well ya once each project, e.g. “grad level math”, “k—12 math”, “undergrad math”, “k-12 chemistry”, etc is sufficient they are all fed into a larger more powerful model.

Maybe my RLHF work does make it harder for me to dream, but I teach models math which means a lot of prompt writing, and yet I have not found a way to have the model teach me math I don’t know yet (and there’s a lot I don’t know). It’s fun to play around with, but I still gravitate toward the isolated texts, not the aggregation as too much is lost or averaged in my opinion/experience. But hey maybe I’m overtrained on the traditional learning methods.

To be honest, I have gotten 100x more useful answers out of Siri's WolframAlpha integration than I ever have out of ChatGPT. People don't want a "not completely incompetent graduate student" responding to their prompts, they want NLP that reliably processes information. Last-generation voice assistants could at least do their job consistently, ChatGPT couldn't be trusted to flick a light switch on a regular basis.

I use both for different things. WolframAlpha is great for well-defined questions with well-defined answers. LLMs are often great for anything that doesn't fall into that.

I use home assistant with the extended open ai integration from HACS. Let me tell you, it’s orders of magnitude better than generic voice assistants. It can understand fairly flexibly my requests without me having a literal memory of every device in the house. I can ask for complex tasks like turning every light in the basement on without there being a zone basement by inferring from the names. I have air quality sensors throughout and I can ask it to turn on the fan in areas with low air quality and if literally does it without programming an automation.

Usually Alexa will order 10,000 rolls of toilet paper and ship them to my boss when I ask it to turn on the bathroom fan.

Personally tho the utility of this level of skill (beginner grad in many areas) for me personally is in areas I have undergraduate questions in. While I literally never ask it questions in my field, I do for many other fields I don’t know well to help me learn. over the summer my family traveled and I was home alone so I fixed and renovated tons of stuff I didn’t know how to do. I work a headset and had the voice mode of ChatGPT on. I just asked it questions as I went and it answered. This enabled me to complete dozens of projects I didn’t know how to even start otherwise. If I had had to stop and search the web and sift through forums and SEO hell scapes, and read instructions loosely related and try to synthesize my answers, I would have gotten two rather than thirty projects done.

How does this square up with literally what Terence Tao (TFA) writes about O1? Is this meant to say there's a class of problems that O1 is still really bad at (or worse than intuition says it should be, at least)? Or is this "he says, she says" time for hot topics again on HN?

o1-preview is still quite a specialized model, and you can come up with very easy questions that it fails embarassingly despite it's success in seemingly much more difficult tests like olympiad programming/maths questions.

You certainly shouldn't think of it like having access to a graduate student whenever you want, although hopefully that's coming.

I've tried the ChatGPT integration and it was kinda just useless. On smaller datasets it told me nothing that wasn't obviously apparent from the charts and tables; on larger datasets it couldn't do much besides basic key/value retrieval. Asking it to analyze a large time-series table was an exercise in futility, I remain pretty unimpressed with current offerings.

Then you have a skill issue. 10 million paying are for GPT monthly because a large of them are getting useful value out of it. WolframAlpha has been out for a while and didn't take off for a reason. "GPT couldn't be trusted to flick a light switch on a regular basis" pretty much implies you are not serious or your knowledge about the capabilities of LLM is pretty much dated or derived from things you have read.

Not the same technology at all, until recently at least.

EDIT: Looks like I hurt someone's feelings by killing their unicorn. It was going to happen sooner or later, and pretending isn't very constructive. In fact, pretending this technology is reliable is a very risky thing to do.

The o1 model is really remarkable. I was able to get very significant speedups to my already highly optimized Rust code in my fast vector similarity project, all verified with careful benchmarking and validation of correctness.

Not only that, it also helped me reimagine and conceptualize a new measure of statistical dependency based on Jensen-Shannon divergence that works very well. And it came up with a super fast implementation of normalized mutual information, something I tried to include in the library originally but struggled to find something fast enough when dealing with large vectors (say, 15,000 dimensions and up).

While it wasn’t able to give perfect Rust code that compiled on the very first try, it was able to fix all the bugs in one more try after pasting in all the compiler warning problems from VScode. In contrast, gpt-4o usually would take dozens of tries to fix all the many rust type errors, lifetime/borrowing errors, and so on that it would inevitably introduce. And Claude3.5 sonnet is just plain stupid when it comes to Rust for some reason.

I really have to say, this feels like a true game changer, especially when you have really challenging tasks that you would be hard pressed to find many humans capable of helping with (at least without shelling out $500k+/year in compensation for).

And it’s not just the performance optimization and relatively bug free code— it’s the creative problem solving and synthesis of huge amounts of core mathematical and algorithmic knowledge plus contemporary research results, combined with a strong ability to understand what you’re trying to accomplish and making it happen.

Here is the diff to the code file showing the changes:

https://github.com/Dicklesworthstone/fast_vector_similarity/...

But a lot of what you pay humans $500k a year for is to work with enormous existing systems that an LLM cannot understand just yet. Optimizing small libraries and implementing fast functions though is a huge improvement in any programmer's toolbox.

Yes, that’s certainly true, and that’s why I selected that library in particular to try with it. The fact that it’s mathematical— so not many lines of code, but each line packs a lot of punch and requires careful thought to optimize— makes it a perfect test bed for this model in particular. For larger projects that are simpler, you’re probably better off with Claude3.5 sonnet, since it has double the context window.

Yes, but its reasoning ability is extremely poor in my experience with real world programming tasks. I’m talking about stuff that Claude3.5 Sonnet handles easily, and GPT4o can also handle if it can fit in its smaller context window, where Gemini 1.5 pro just completely fails.

Bigger context is definitely helpful, but not if it comes at the expense of reasoning/analytical ability. I’m always a bit puzzled why people stress the importance of these “needle in a haystack” tests where the model has to find one specific thing in a huge document. That seems far less relevant to me in terms of usefulness in the real world.

> I’m always a bit puzzled why people stress the importance of these “needle in a haystack” tests where the model has to find one specific thing in a huge document. That seems far less relevant to me in terms of usefulness in the real world.

How do you mean?

Half of writing code within a codebase, is knowing what functions already exist in the codebase for you to call in your own code; and/or, what code you'll have to change upstream and downstream of the code you're modifying within the same codebase — or even by forking your dependencies and changing them — to get what you want to happen, to happen.

And half of, say, writing a longform novel, is knowing all the promises you've made to the reader, the active Chekov's guns, and all the other constraints you've placed on yourself by hundreds of pages or even several books ago, that just became relevant again as of this very sentence. Or, moreover, which of those details it's the proper time to make relevant again for maximum impact and proper first-in-last-out narrative bridging structure.

In both cases, these aren't really literal "needle in a haystack" stress-tests; they should properly be tests of the model's ability to perform some kind of "associational priority indexing" on the context, allowing it to build concepts into associational sub-networks and then make long-distance associations where the nodes are entire subnetworks. (Which isn't something we really see yet, in any model.)

Yes agreed, I wasn’t trying to say it’s totally useless, but it’s not as helpful as synthesizing all that context intelligently. It’s more of a parlor trick. But that trick can be handy if you need something like that. Really, the main issue with Gemini is that it’s simply not very smart compared to the competition, and the big context doesn’t make up for that in the slightest.

>you would be hard pressed to find many humans capable of helping with (at least without shelling out $500k+/year in compensation for).

And now we have a $number we can relate, and refer, to.

And sometimes it just bugs out and doesn't give any response? Faced that twice now, it "thought" for like 10-30s then no answer and I had to click regenerate and wait for it again.

I've seen it take over a couple of minutes, at which point I switched to Claude. And have seen reports of it taking even longer. So it may be that you didn't wait long enough.

Thinking about training LLMs on geometry. A lot of information in the sources would be contained in the diagrams accompanying the text. This model is not multi-modal, so maybe it wasn't trained on the accompanying diagrams at all.

I would really like if people check on a set of geometry and a set of analysis questions and compare the difference.

It will be trash. I'll have to dig up a chat I had the weekend GPT4 was released, I was musing about dodecahedron packing problems and GPT4 started with an assertion that a line through a sphere intersects the surface 3 times.

Maybe if you fine tuned it on Euclid's Elements and then allowed it to run experiments with Mathematica snippets it could check its assumptions before spouting nonsense

The novelty to me is that the “The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.” in so many subject areas! I have found great value in using LLMs to sort things out. In areas where I am very experienced it can be really helpful at tons of small chores. Like Terrence was pointing out in his third experiment — if you break the problem down it does solid work filling in smaller blanks. You need the conceptual understanding. Part of this is prompting skill. If you go into an area you don’t know you have to try and build the prompts up. Dive into something small and specific and work outward if the answer is known. Start specific and focused if starting from the outside in. I have used this to cut through conceptual layers of very complex topics I have zero knowledge in and then verify my concepts via experts on YT/research papers/trusted sources. It is an amazing tool.

This has been my experience as well. I treat LLMs like an intern or junior who can do the legwork that I have no bandwidth to do myself. I have to supervise it and help it along, checking for mistakes, but I do get useful results in the end.

Attitudinally, I suspect people who have had experience supervising interns or mentoring juniors are probably those who are able to get value out of LLMs (paid ones - free ones are no good) rather than grizzled lone individual contributors -- I myself have been in this camp for most of my early career -- who don't know how to coax value out of people.

It's interesting that humans would also benefit from the "chain of thought" type reasoning. In fact, I would argue all students studying math will greatly increase their competence if they are required to recall all relevant definition and information before using it. We don't do this in practice (including teachers and mathematicians!) because recall is effortful, and we don't like to spent more effort than necessary to solve a problem. If recall fails, then we have to look up information which takes even more effort. This is why in practice, there is a tremendous incentive to just "wing it".

AI has no emotional barrier to wasted effort, which make them better reasoners than their innate ability would suggest.

Showing your work in tests is kind of like “chain of thought” reasoning, but there’s a slight difference. Both force you to break down your process step by step, making sure the logic holds and you aren’t skipping crucial steps. But while showing your work is more about demonstrating the correct procedure, “chain of thought” reasoning pushes you to recall relevant definitions and concepts as you go, ensuring a deeper understanding. In both cases, the goal is to avoid just “winging it,” but “chain of thought” really digs into the recall aspect, which humans tend to avoid because it’s effortful.

I assumed that everybody did this when trying to solve a maths problem they are stuck on (thinking university type level maths rather than school maths) and when I was teaching I would always get people to go back to the definitions.

I wasn't amazing at maths research (did a PhD and post-doc and then gave up) but my experience was that it was partly thinking hard about things and grappling with what was going on and trying to break it down somehow, but also scanning everything you know related to the problem, trying to find other problems that resemble it in some way that you can steal ideas from etc.

Wow! I love this take. Somehow with all this evidence of COT helping out LLMs, I never thought about using it more myself. Sure, we kind of do it already but definitely not to the degree of LLMs, at least not usually. Maybe that's why writing is so often admired as a way to do great thinking - it enables longer chains of thoughts with less effort.

>The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student. However, this was an improvement over previous models, whose capability was closer to an actually incompetent graduate student.

Appreciate the no fucks given categorization of grad students.

Completely agree with Terence Tao. this is a real advancement. I've always believed that with the right data allowing the LLM to be trained to imitate reasoning, it's possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the "reasoning programs" or "reasoning patterns" the model learned during the reinforcement learning phase. https://www.lycee.ai/blog/openai-o1-release-agi-reasoning

The o1 model is hit and miss for me. On one hand it has solved the NYT Connections game [0] each day I've tried it [1]. Other models, including Claude Sonnet 3.5 cannot.

But on the other hand it misses important detail and hallucinates, just like GPT-4o. And can need a lot of hand holding and correction to get to the right answer, so much so that sometimes you wonder if it would have been easier to just do it yourself. Only this time it's worse because you're waiting 20-60 seconds for an answer.

I wonder if what it excels at is just the stuff that I don't need it for. I'm not in classic STEM, I'm in software engineering, and o1 isn't so much better that it justifies the wait time (yet).

One area I haven't explored is using it to plan implementation or architectural changes. I feel like it might be better for this, but need the right problems to throw at it.

[0] https://www.nytimes.com/games/connections

[1] https://chatgpt.com/share/66e40d64-6f70-8004-9fe5-83dd3653a5...

I'm so excited in anticipation of my near-term return to studying math, as an independent curiosity hobby. It's going to be epically fun this time around with LLM's to lean on. Coincidentally like Terence Tao, I've also been asking complex analysis queries* of LLM's, things I was trying to understand better in my working through textbooks. Their ability to interpret open-ended math questions, and quickly find distant conceptual links that are helpful and relevant, astonishes me. Fields laureate Professor Tao (naturally) looks down on the current crop of mathematics LLM—"not completely incompetent graduate student..."—but at my current ability level that just means looking up.

*(I remember a specific impressive example from 6 months ago: I asked if certain definitions could be relaxed to allow complex analysis on a non-orientable manifold, like a Klein bottle, something I spent a lot of time puzzling over, and an LLM instantly figured out it would make the Cauchy-Riemann equations globally inconsistent. (In a sense the arbitrary sign convention in CR defines an orientation on a manifold: reversing manifold orientation is the same as swapping i with -i. I understand this now, solely because an LLM suggested looking at it). Of course, I'm sure this isn't original LLM thinking—the math's certainly written down somewhere in its training material, in some highly specific postgraduate textbook I have no knowledge of. That's not relevant to me. For me, it's absolutely impossible to answer this type of question, where I have very little idea where to start, without either an LLM or a PhD-level domain specialist. There is no other tool that can make this kind of semantic-level search accessible to me. I'm very carefully thinking how best to make use of such an, incredibly powerful but alien, tool...)

I agree. Having access to a kind of semantic full search engine on basically all textbooks on Earth feels like a superpower. Even better would be if it could pinpoint the exact textbook references it found the answer in.

How can you verify a proof though? Pure math isn't really about computations, and it can be very hard to spot subtle errors in a proof that an LLM might introduce, especially since they seem better at sounding convincing rather than being right.

The examples in this book are extraordinarily simple, and covers material that many proof assistants were designed to be extremely good at expressing. I wouldn't be surprised if a LLM could automate the exercises in this book completely.

Writing nontrivial proofs in a theorem prover is a different beast. In my experience (as someone who writes mechanized mathematical proofs for a living) you need to not only know the proof very well beforehand, but you also need to know the design considerations for all of the steps you are going to use beforehand, and you also need to think about all of the ways your proof is going to be used beforehand. Getting these wrong frequently means redoing a ton of work, because design errors in proof systems are subtle and can remain latent for a long time.

The issue is that a hypothetical answer from a LLM is not even remotely easy to directly put into lean. You might ask the LLM to give you an answer together with a lean formalization, but the issue is that this kind of 'autoformalization' is at present not at at all reliable.

Tao says that isn't the case for all of it and that on massive collaborative projects he's done many nonmathemeticians did sections of them. He says someone who understands it well needs to do the initial proof sketch and key parts but that lots of parts of the proof can be worked on by nonmathemeticians.

If Tao says he's interested in something being coded in lean, there are literal teams of people who will throw themselves at him. Those projects are very well organized from the top down by people who know what they're doing, it's no surprise that they are able to create some space for people who don't understand the whole scope.

This is also the case for other top-profile mathematicians like Peter Scholze. Good luck to someone who wants to put chatgpt answers to random hypotheticals into lean to see if they're right, I don't think they'll have so easy a time of it.

How will we even measure this? Benchmarks are gamed/trained on and there is no way that there is much signal in the chatbot arena for these types of queries?

I think in just a few month the average user will not be able to tell the difference in performance between the major models

He mentions that he posed to O1 the same challenge he posed to a previous GPT (which he also previously blogged about), so I am wondering how much O1 benefited from potentially "seeing" this discussion in its training set (which probably contains a very well recent snapshot of the world wide web).

I wonder if those responses could already be influenced by the fact that the cutoff for some of the models out there was indeed 2023 and people wrote about it all over the internet.

> The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.

Coming from Terence Tao that seems pretty remarkable to me?

Daniel Litt, an algebraic geometer on twitter, said "Pretty impressed by o1-preview! Still not having much luck asking it to do any interesting math but it seems much more reliable with simple things; I can actually imagine it being a net time-saver at this point with some non-mathematical tasks."

Any other takes by mathematicians out there?

I'm not a mathematician much beyond AP Calc in high school (almost 40 years ago). I am deeply fascinated by Bézier curves and geometric continuity. I've spent a lot of time digging up research papers and references about this and related Computer Aided Geometric Design mathematics. Mostly I skim them for the illustrations and more geometric relations. For several years I've been trying to understand how to make sure that a Bézier curve is G3 to an adjoining curve, given the tangent direction, and first and second curvature derivatives.

I've tried a variety of ways to ask various LLMs to help solve this. Finally with access to ChatGPT o1-preview I was able to get a good answer. The first answer was wrong, but with a little more prompting and clarification I was able to get the answer I wanted to relate the positions of P0, P1, P2 and P3 so that a Bézier curve could be G3. This isn't something that is unknown because there are many CAD programs which can do this already, but I had not been able to find the answer I was looking for in a form that was useful to me.

I don't really know where that puts o1-preview relative to a math grad student, but after spending tons of time over a couple years on this pet project, getting an answer from a chat bot was one of the more magical moments I've had with technology in a long time.

Well, one thing is clear.

Math grad students everywhere now have a benchmark to determine if Terry Tao considers them to be mediocre or incompetent.

As a previously "mediocre, but not completely incompetent, graduate student" at a top research university (who's famous advisor was understandably frustrated with him), I consider this a huge win!

Tao mentions grad students; I wonder how they feel reading this?

As LLMs continue to improve I feel like anyone making a living doing the "99% perspiration" part of intellectual labor is about to enter a world of hurt.

> The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.

And you thought you had imposter syndrome before!

No, almost everyone who gets funding for a business already belongs to the monied royalty and gets it either directly from his family, via friends of the family or laundered through a VC.

There are exceptions of course, but that's how the bulk of businesses, especially those with stupid ideas are funded. In the latter category success does not even matter, the trust fund baby just has to have the appearance of a leader position.

There is truth to this, but you’re overstating it. If AI is cheap and can replace grunt workers then we’ll have a massive wave of new businesses solving problems that previously required a massive capital investment.

not fully related to what the parent is saying, but I need to get this off my chest:

isn't this development obviously going to result in the deprecation of the value of the human intellect to near-zero? which is the thing that virtually all people on this platform base their livelihood on?

there's such a deafening silence around this topic on the internet where there should be - i don't know what but not this silence. we don't know what to do right? and we're avoiding this topic.

with this version they broke the assumed wall of llms's developed that was the last copium that we could believe in. the wall is broken and now it's just a matter of time until your capacity to think will be completely unneeded. the machine will do it more accurately and more quickly by orders of magnitude.

am I a doomer? I was in the home country of my parents recently, that is completely dysfunctional and war is on the verge of breaking out. what I learned there is that humans stay ignorant of great dangers until the very moment in which it affects them. this must've been the case with all the great wars that we've had. the water is rising but until I start to suffocate I don't agree to see it. i make up copes, or I think some are going to drown but I'm safe, or I distract myself.

what are all the software engineers here thinking? what's your cope for this? or are we all freezing in shock right now? this o1 is solving problems that i know many of my colleagues can never solve. what are we hoping for I think? I don't have a future because my future was the image that I had of it. and no image of the future that would be nice to keep around seems plausible at this point.

Tao and Aaronson are optimistic about LLMs. What are they telling their students? That math and science degrees will soon have the same value as a degree in medieval dance theory?

If they are overly optimistic, perhaps it would be good to hear the opinions of Wiles and Perelman.

Tao isn't that optimistic. His opinion on LLMs is rather conservative.

https://www.scientificamerican.com/article/ai-will-become-ma...

> If you want to prove an unsolved conjecture, one of the first things you need to do is to break it up into smaller pieces, each of which has a better chance of being proven. But you will often break up a problem into harder problems. It’s very easy to transform a problem into one that’s harder than into one that’s simpler. And AI has not demonstrated any ability to be any better than humans in this regard.

Not sure if O1 changed his mind tho.

If you look at a lot of people’s PHDs, we now teach these things to 1st years. PhDs today do incredible deep work and the edge of science will just go further.

What does this mean? Of course math AI will take over top research in next ten years but usefulness to society has never been a goal of pure mathematics. I don't know if you understand the motivation for studying pure math. Personally I think it will be mostly good for research math

The "value of a degree" means the employment prospects for the degree holder.

Which is going to zero if the optimistic predictions are correct, so the optimistic professors should warn their students.

I understand the motivation for pure math quite well. It is about beauty, understanding things and discovering things for oneself. If computers do the work, the discovery part is gone and pure math is ruined.

For the non-research part, the AI zealots will want to replace all human labor with software.

not fully related to what the parent is saying, but I need to get this off my chest:

isn't this development obviously going to result in the deprecation of the value of the human intellect to near-zero? which is the thing that virtually all people on this platform base their livelihood on?

there's such a deafening silence around this topic on the internet where there should be - i don't know what but not this silence. we don't know what to do right? and we're avoiding this topic.

with this version they broke the assumed wall of llms's developed that was the last copium that we could believe in. the wall is broken and now it's just a matter of time until your capacity to think will be completely unneeded. the machine will do it more accurately and more quickly by orders of magnitude.

am I a doomer? I was in the home country of my parents recently, that is completely dysfunctional and war is on the verge of breaking out. what I learned there is that humans stay ignorant of great dangers until the very moment in which it affects them. this must've been the case with all the great wars that we've had. the water is rising but until I start to suffocate I don't agree to see it. i make up copes, or I think some are going to drown but I'm safe, or I distract myself.

what are all the software engineers here thinking? what's your cope for this? or are we all freezing in shock right now? this o1 is solving problems that i know many of my colleagues can never solve. what are we hoping for I think? I don't have a future because my future was the image that I had of it. and no image of the future that would be nice to keep around seems plausible at this point.

I wouldn't say there is a silence (as in avoidance) there are just folks convinced it's going to completely replace people (with 2 main subgroups: utopia and dystopia), folks convinced it's a parlor trick and never going to result in more, and folks convinced it's just the next efficiency increasing tool where some busy-ness we have will go away but only to make room for increasing total output not for replacing everyone wholesale.

Generally these folks have all said their piece and are tired of talking about it every time LLMs come up -> silence (as in nothing more to say) as each group is self convinced and most don't necessarily feel the need to get 100% of folks on board with their view. The dystopia or "doomer" group are the main ones left feeling like they need more of an answer, the rest move on quietly in either excitement or disinterest.

Why are you saying this as if it was a bad thing? Just because software becomes better at us at something doesn't mean we can't do it out of fun (e.g. see chess community for example).

It needs a bigger context, but the moment someone can feed an entire GitHub repo into this thing and ask it to fix bugs... I think O2 may be the beginning of the end.

Is there a list of discoveries or siginficant works/constructions made by people collaborating with LLM's? I mean as opposed to specific deep networks like Alphafold or Graphcast?

It may cause a reputation or legal issue, so it is not in their interest to admit it. In the real world, is there PhD students or researchers using ChatGPT to move forward and help them think their ideas ?

Obviously yes, but admitting it may not be the right move.

Reading anything Terrence Tao writes is thought provoking and I doubt I’m seeing anything others haven’t.

There’s at least a “complexity” if not a “problem” in terms of judging models that to a first approximation have been trained on “everything”.

Have people tried putting these things up against serious mathematical problems that are well studied? With or with Lean hinting has anyone gotten like, the Shimura-Taniyama conjecture/proof out?

>could not generate conceptual ideas of their own

Is the most important part imo. A big goal should be some ai system coming up with its own discovery and ideas. Really unclear how we can get from the current paradigm to it coming up with something like general relativity, like Einstein. Does it require embodiment?

Why should that be a big goal? It's difficult, it's not what they are good at, and they can get a lot better at assisting in other ways through incremental improvements. I'm happy to leave this part to the humans, at least for now, especially when there's so much more improvement still possible in other directions.

It also seems like one of those things where we ought to ask whether we should, before asking whether we could. Why not focus on areas that are easier, more beneficial, and less problematic from a "should" perspective?

Here's a little test I try on LLMs. So far only O1 and Microsoft Copilot (bing chat) was able to solve it:

Find a, b, c distinct positive integers satisfying a^3 + b^3 = c^4. Hint: try dividing all sides by c^3, then giving values to (a/c) and (b/c).

"with even the latest tools the effort put in to get the model to produce useful output is still some multiple (but not an enormous multiple now, say 2x to 5x) of the effort needed to properly prompt and verify the output. However, I see no reason to prevent this ratio from falling below 1x in a few years, which I think could be a tipping point for broader adoption of these tools in my field"

Given the log scale on compute to improve performance, it is not a guarantee that the ratio can be improved so much in a few years

Ok lol. I mean just feed it serge lang problems and see what it does is all he’s saying.

It performs way better than undergrads. Funny he didn’t point that out but only made some slight to it about being a bad graduate student. Don’t believe me, open the book and ask away. It’s amazing, even if it is a “mediocre graduate student” which is far better than a good graduate student or professor that gives you no help or time for all that money you forked over.

It’s already worth the money, ignore this shitty write up by someone they doesn’t need its help.

Probably. The missing factor is the dataset and the fact that so far OAI seems to be the only one who has figured out how to train this thing for reasoning.

But yeah, given o1 exists, it looks very doable. It's hard to imagine a reason for why something matching his criteria would be more than a decade out.

This seems like it's just feeding the output back into the model and using more compute to try and get better answers. If that's all, I don't see how it fundamentally solves any of the issues currently present in LLMs. Maybe a marginal improvement in accuracy at the cost of making the computation more expensive. And you don't even get to see the so called reasoning tokens.

It varies a lot. If it's a simple question, it just does 3-4 sections of "thinking & reflection" but for more complicated ones I think I've seen something like 10 or more. Maybe 3-4 seconds per section on average I'd guess.

It's unclear if Terence is referring to "GPT-o1... a prototype version of the model that I was granted access to" as in "he was given access to GPT-o1 by the research team" or as in "he is using o1-preview". The differences in scale and quality between his shared output and the answer I get trying the same prompt from o1-preview suggest perhaps the former (otherwise luck). I haven't actually seen any examples of how long o1 "full" will think about this kind of question, though I expect it's somewhere in the same ballpark given the thought expansion still only has one real concept in it.

Idk I think the fact that it needs “hints” and “prodding” is a good thing, myself. Otherwise we wouldn’t need humans to get those answers, would we. I want it to augment humans, not replace them.

I checked the links and I think it's amazing and it answers with Latex formatted notation.

But I was curious and I asked something very simple, Euclid's first postulate and I got this answer:

Euclid's Postulate 1: "Through any two points, there is exactly one straight line."

In fact Euclid's Postulate 1 is "To draw a straight line from any point to any point." http://aleph0.clarku.edu/~djoyce/java/elements/bookI/bookI.h...

I think AI answer is not correct, it may be some textbook interpretation but I was expecting Euclid's exact wording.

Edit: Google's Gemini gives the exact wording of the postulate and then comments that this means that you can draw one line bitween two points. I think this is better

The original text is: Ἠιτήσθω ἀπὸ παντὸς σημείου ἐπὶ πᾶν σημεῖον εὐθεῖαν γραμμὴν ἀγαγεῖν. Roughly: let it be required that from any point to any point it is possible to draw a straight line.

Both gpt4o and o1 roughly know the correct original text, so prompting, the model’s background memory, or random chance may influence your outcomes, though hopefully (in an improved model) you should never get you incorrect info.

https://farside.ph.utexas.edu/Books/Euclid/Elements.pdf

Edit: in case it isnt clear, I could not reproduce this error on my end with o1-mini

it's definitely wrong though, "exactly one" straight line between two points is a different postulate and a stronger one.

Euclid has been translated, restated, and re-presented in enough books and textbooks that I'd expect a big-enough LLM to have actually memorized this correctly tbh

Agreed. That is what the original poster said. I didnt manage to reproduce the error on my end, but I dont know the full context or maybe the memory on my end changes the output.

> I think AI answer is not correct, it may be some textbook interpretation but I was expecting Euclid's exact wording

It was written before English even existed. That said, the original never implied "exactly one", so I agree its a bad translation.

Regardless of the wording being exact or not, ChatGPT’s answer is incorrect in its contents. The statement “exactly one” requires the parallel postulate, since otherwise it’s not necessarily true. Specifically in spherical geometry, which is considered to be consistent with Euclid’s first four postulates (i.e. without the parallel postulate).

The bottom line is, you can’t take any single LLM statement at face value, even in seemingly easy to answer cases like this.

Euclid's Elements is less pervasive on the Internet then content produced for Liberal Arts math courses. As those courses tend to emphasize critical thinking and problem-solving math over pure theory and advanced concepts, they tend to be far more common and tend to win compared to more domain specific meanings.

Examples:

https://en.wiktionary.org/wiki/Euclidean_geometry

https://www.cerritos.edu/dford/SitePages/Math_70_F13/Postula...

Problems with polysemy across divergent, more advanced theories has been one of my biggest challenges in probing some of my areas of intrest.

Funny enough, one of my pet areas of obscure interest, riddled basins, is constantly muddied not by math, but LSAT questions, specifically non-math content directed at a reading comprehension test: "September 2006 LSAT Section 1 Question 26"

IMHO a lot of the prompt engineering you have to do with these highly domain specific problems is avoiding the most common responses in the corpus.

LLM responses will tend to reflect common usage, not academic terminology unless someone cares enough to change that for a specific case.

“Exact wording” would be Ancient Greek. Euclid did not even write in English. You’re checking whether the model matches a specific translation, which is not valuable. If you search around you’ll find many sources that choose a more intelligible phrasing.

（评论） (comments)

（评论）
(comments)