![]() |
|
![]() |
| I entirely agree about their utility.
HN, and the internet in general, have become just an ocean of reactionary sandbagging and blather about how "useless" LLMs are. Meanwhile, in the real world, I've found that I haven't written a line of code in weeks. Just paragraphs of text that specify what I want and then guidance through and around pitfalls in a simple iterative loop of useful working code. It's entirely a learned skill, the models (and very importantly the tooling around them) have arrived at the base line they needed. Much Much more productive world by just knuckling down and learning how to do the work. edit: https://aider.chat/ + paid 3.5 sonnet |
![]() |
| This is almost entirely written by LLMs:
https://github.com/williamcotton/guish I was the driver. I told it to parse and operate on the AST, to use a plugin pattern to reduce coupling, etc. The machine did the tippy-taps for me and at a much faster rate than I could ever dream of typing! It’s all in a Claude Project and can easily and reliably create new modules for bash commands because it has the full scope of the system in context and a ginormous amount of bash commands and TypeScript in the training corpus. |
![]() |
| >There are no uncertainties or misunderstandings here. The productivity gains are real and the code produced is more robust. Not in theory, but in practice.
So, that may be a fact for you but there are mixed results when you go out wide. For example [1] has this little nugget: >The study identifies a disconnect between the high expectations of managers and the actual experiences of employees using AI. >Despite 96% of C-suite executives expecting AI to boost productivity, the study reveals that, 77% of employees using AI say it has added to their workload and created challenges in achieving the expected productivity gains. Not only is AI increasing the workloads of full-time employees, it’s hampering productivity and contributing to employee burnout. So not everyone is feeling the jump in productivity the same way. On this very site, there are people claiming they are blasting out highly-complex applications faster than they ever could, some of them also claiming they don't even have any experience programming. Then others claiming that LLMs and AI copilots just slow them down and cause much more trouble than they are worth. It seems like just with programming itself, that different people are getting different results. [1]https://www.forbes.com/sites/bryanrobinson/2024/07/23/employ... |
![]() |
| Yes that's exactly what they are doing.
I literally had someone with the balls to tell me that it was ChatGPT's fault. Due diligence and intelligence has shit the fucking bed quite frankly. |
![]() |
| This. I’ve been using elixir for ~6 months (guided by Claude) and probably couldn’t solve fizz buzz at a whiteboard without making a syntax error. Eek. |
![]() |
| Somebody tested people on Hacker News to evaluate programming competency.
This was part of a larger evaluation comparing the Hacker News population to people on Reddit programming subreddits. Here is a very heated discussion of the result: https://news.ycombinator.com/item?id=33293522 It appears that Hacker News is perhaps NOT populated by the programming elite. In contrast, there are real wizards on Reddit. Surprising, I know. |
![]() |
| You're listing plain models, so I'm assuming you're using them directly. Aider and similar agents use those models but they don't step at the first answer. You can add test running and a linter to the request and it will essentially enter a loop like: what are the steps to solve (prompt)?; here's a map of the repository, which files do you need?; what's your proposed change?; here's the final change and the test run, do you think the problem has been solved?; (go back to the beginning if not)
See the video at https://plandex.ai/ to get an idea how it works. |
![]() |
| I appreciate this thoughtful comment.
You are absolutely right that the trajectories, if taken linearly, might hit a dead end. I should clarify that when I mentioned "trajectories" I don't mean unpunctuated ones. I am myself not convinced that LLMs -- despite their value to me today -- will eventually lead to AGI as a matter of course, nor the type of techniques used in autopilot will lead to L5 autonomy. And you're right that they are consuming a lot of our resources, which could well be better invested in a possibly better alternative. I subscribe to Thomas Kuhn's [1] idea of scientific progress happening in "paradigms" rather than through a linear accumulation of knowledge. For instance, the path to LLMs itself was not linear, but through a series of new paradigms disrupting older ones. Early natural language processing was more rule-based (paradigm), then it became more statistical (paradigm), and then LLMs supplanted the old paradigms through transformers (paradigm) which made it scale to large swaths of data. I believe there is still significant runway left for LLMs, but I expect another paradigm must supplant it to get closer to AGI. (Yann Lecun said that he doesn't believe LLMs will lead to AGI). Does that mean the current exuberant high investments in LLMs are misplaced? Possibly, but in Kuhn's philosophy, typically what happens is a paradigm will be milked for as much as it can be, until it reaches a crisis/anomaly when it doesn't work anymore, at which point another paradigm will supplant it. At present, we are seeing how far we can push LLMs, and LLMs as they are have value even today, so it's not a bad approach per se even though it will hit its limits at some point. Perhaps what is more important are the second-order effects: the investments we are seeing in GPUs (essentially we are betting on linear algebra) might unlock the kind of commodity computational power the next paradigm needs to disrupt the current one. I see parallels between this and investments in NASA resulting in many technologies that we take for granted today, and military spend in California producing the technology base that enabled Silicon Valley today. Of course, these are just speculations and I have no more evidence that this is happening with LLMs than anyone else. I appreciate your point however and it is always good to step back and ask, non-cynically, whether we are headed down a good path. [1] https://en.wikipedia.org/wiki/The_Structure_of_Scientific_Re... |
![]() |
| Given how hard we thought programming was a year or two ago, I wouldn't bank my future on design being too hard for an LLM. They're already quite good at helping writing design docs. |
![]() |
| Can you point me to that paper? What version of the model were they using?
Have you tried again with the latest LLMs? ChatGPT4 actually (correctly) explains what each constraint does in English -- it doesn't just provide the constraint when you ask it for the formulation. Also, not sure if CPLEX should be involved at all -- I usually just ask it for mathematical formulations, not CPLEX calling code (I don't use CPLEX). The OR literature primarily contains math formulations and that's where LLMs can best do pattern matching to problem shape. Many of the standard formulations are in here: https://msi-jp.com/xpress/learning/square/10-mipformref.pdf All the LLM is doing is fitting the problem description to a combination of these formulations (and others). |
![]() |
| I was referring to section 4 of A Survey for Solving Mixed Integer Programming via Machine Learning(2024): https://arxiv.org/pdf/2401.03244.
I’ve heard (but not so much observed) that there is substantial difference between recent models, so it’s possible that they are better than when this was written. Anyways, CPLEX has an associated modeling language that features syntactic sugar which has the effect of providing opaqueness to the underlying MILP that it solves. I find LLMs essentially unable to even make an attempt at determining the MILP from that language. PS: How is Xpress? Is there some reason to prefer it to Gurobi or Mosek? |
![]() |
| The first profession AI seems on track to decimate is programming. In particular, the brilliant but remote and individual contributor. There is an obvious conflict of interest in this forum. |
![]() |
| The large majority of professional writers and artists produce thankless commodity output for things like TV advertisements, games, SEO content. These jobs should be threatened. |
![]() |
| Once we have AI-based language servers, which will, at some point in the future, be able to track entire repositories, I think maintaining projects will actually be far easier than right now. |
![]() |
| Same experience, but with TypeScript and Go. They gave me a 60-day trial (IIRC), I used it for two days, disabled it for the next 58 days, and after that removed it from the editor. |
![]() |
| I use both for different things. WolframAlpha is great for well-defined questions with well-defined answers. LLMs are often great for anything that doesn't fall into that. |
![]() |
| The o1 model is really remarkable. I was able to get very significant speedups to my already highly optimized Rust code in my fast vector similarity project, all verified with careful benchmarking and validation of correctness.
Not only that, it also helped me reimagine and conceptualize a new measure of statistical dependency based on Jensen-Shannon divergence that works very well. And it came up with a super fast implementation of normalized mutual information, something I tried to include in the library originally but struggled to find something fast enough when dealing with large vectors (say, 15,000 dimensions and up). While it wasn’t able to give perfect Rust code that compiled on the very first try, it was able to fix all the bugs in one more try after pasting in all the compiler warning problems from VScode. In contrast, gpt-4o usually would take dozens of tries to fix all the many rust type errors, lifetime/borrowing errors, and so on that it would inevitably introduce. And Claude3.5 sonnet is just plain stupid when it comes to Rust for some reason. I really have to say, this feels like a true game changer, especially when you have really challenging tasks that you would be hard pressed to find many humans capable of helping with (at least without shelling out $500k+/year in compensation for). And it’s not just the performance optimization and relatively bug free code— it’s the creative problem solving and synthesis of huge amounts of core mathematical and algorithmic knowledge plus contemporary research results, combined with a strong ability to understand what you’re trying to accomplish and making it happen. Here is the diff to the code file showing the changes: https://github.com/Dicklesworthstone/fast_vector_similarity/... |
![]() |
| >you would be hard pressed to find many humans capable of helping with (at least without shelling out $500k+/year in compensation for).
And now we have a $number we can relate, and refer, to. |
![]() |
| And sometimes it just bugs out and doesn't give any response? Faced that twice now, it "thought" for like 10-30s then no answer and I had to click regenerate and wait for it again. |
![]() |
| I've seen it take over a couple of minutes, at which point I switched to Claude. And have seen reports of it taking even longer. So it may be that you didn't wait long enough. |
![]() |
| Completely agree with Terence Tao. this is a real advancement. I've always believed that with the right data allowing the LLM to be trained to imitate reasoning, it's possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the "reasoning programs" or "reasoning patterns" the model learned during the reinforcement learning phase.
https://www.lycee.ai/blog/openai-o1-release-agi-reasoning
|
![]() |
| The o1 model is hit and miss for me. On one hand it has solved the NYT Connections game [0] each day I've tried it [1]. Other models, including Claude Sonnet 3.5 cannot.
But on the other hand it misses important detail and hallucinates, just like GPT-4o. And can need a lot of hand holding and correction to get to the right answer, so much so that sometimes you wonder if it would have been easier to just do it yourself. Only this time it's worse because you're waiting 20-60 seconds for an answer. I wonder if what it excels at is just the stuff that I don't need it for. I'm not in classic STEM, I'm in software engineering, and o1 isn't so much better that it justifies the wait time (yet). One area I haven't explored is using it to plan implementation or architectural changes. I feel like it might be better for this, but need the right problems to throw at it. [0] https://www.nytimes.com/games/connections [1] https://chatgpt.com/share/66e40d64-6f70-8004-9fe5-83dd3653a5... |
![]() |
| I wonder if those responses could already be influenced by the fact that the cutoff for some of the models out there was indeed 2023 and people wrote about it all over the internet. |
![]() |
| > The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.
Coming from Terence Tao that seems pretty remarkable to me? |
![]() |
| Well, one thing is clear.
Math grad students everywhere now have a benchmark to determine if Terry Tao considers them to be mediocre or incompetent. |
![]() |
| As a previously "mediocre, but not completely incompetent, graduate student" at a top research university (who's famous advisor was understandably frustrated with him), I consider this a huge win! |
![]() |
| > The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.
And you thought you had imposter syndrome before! |
![]() |
| Tao isn't that optimistic. His opinion on LLMs is rather conservative.
https://www.scientificamerican.com/article/ai-will-become-ma... > If you want to prove an unsolved conjecture, one of the first things you need to do is to break it up into smaller pieces, each of which has a better chance of being proven. But you will often break up a problem into harder problems. It’s very easy to transform a problem into one that’s harder than into one that’s simpler. And AI has not demonstrated any ability to be any better than humans in this regard. Not sure if O1 changed his mind tho. |
![]() |
| If you look at a lot of people’s PHDs, we now teach these things to 1st years. PhDs today do incredible deep work and the edge of science will just go further. |
![]() |
| Why are you saying this as if it was a bad thing? Just because software becomes better at us at something doesn't mean we can't do it out of fun (e.g. see chess community for example). |
![]() |
| It needs a bigger context, but the moment someone can feed an entire GitHub repo into this thing and ask it to fix bugs... I think O2 may be the beginning of the end. |
![]() |
| Is there a list of discoveries or siginficant works/constructions made by people collaborating with LLM's? I mean as opposed to specific deep networks like Alphafold or Graphcast? |
![]() |
| Idk I think the fact that it needs “hints” and “prodding” is a good thing, myself. Otherwise we wouldn’t need humans to get those answers, would we. I want it to augment humans, not replace them. |
![]() |
| I checked the links and I think it's amazing and it answers with Latex formatted notation.
But I was curious and I asked something very simple, Euclid's first postulate and I got this answer: Euclid's Postulate 1: "Through any two points, there is exactly one straight line." In fact Euclid's Postulate 1 is "To draw a straight line from any point to any point." http://aleph0.clarku.edu/~djoyce/java/elements/bookI/bookI.h... I think AI answer is not correct, it may be some textbook interpretation but I was expecting Euclid's exact wording. Edit: Google's Gemini gives the exact wording of the postulate and then comments that this means that you can draw one line bitween two points. I think this is better |
![]() |
| The original text is: Ἠιτήσθω ἀπὸ παντὸς σημείου ἐπὶ πᾶν σημεῖον εὐθεῖαν γραμμὴν ἀγαγεῖν. Roughly: let it be required that from any point to any point it is possible to draw a straight line.
Both gpt4o and o1 roughly know the correct original text, so prompting, the model’s background memory, or random chance may influence your outcomes, though hopefully (in an improved model) you should never get you incorrect info. https://farside.ph.utexas.edu/Books/Euclid/Elements.pdf Edit: in case it isnt clear, I could not reproduce this error on my end with o1-mini |
![]() |
| Agreed. That is what the original poster said. I didnt manage to reproduce the error on my end, but I dont know the full context or maybe the memory on my end changes the output. |
![]() |
| Euclid's Elements is less pervasive on the Internet then content produced for Liberal Arts math courses. As those courses tend to emphasize critical thinking and problem-solving math over pure theory and advanced concepts, they tend to be far more common and tend to win compared to more domain specific meanings.
Examples: https://en.wiktionary.org/wiki/Euclidean_geometry https://www.cerritos.edu/dford/SitePages/Math_70_F13/Postula... Problems with polysemy across divergent, more advanced theories has been one of my biggest challenges in probing some of my areas of intrest. Funny enough, one of my pet areas of obscure interest, riddled basins, is constantly muddied not by math, but LSAT questions, specifically non-math content directed at a reading comprehension test: "September 2006 LSAT Section 1 Question 26" IMHO a lot of the prompt engineering you have to do with these highly domain specific problems is avoiding the most common responses in the corpus. LLM responses will tend to reflect common usage, not academic terminology unless someone cares enough to change that for a specific case. |
I work in a field related to operations research (OR), and ChatGPT 4o has ingested enough of the OR literature that it's able to spit out very useful Mixed Integer Programming (MIP) formulations for many "problem shapes". For instance, I can give it a logic problem like "i need to put i items in n buckets based on a score, but I want to fill each bucket sequentially" and it actually spits out a very usable math formulation. I usually just need to tweak it a bit. It also warns against weak formulations where the logic might fail, which is tremendously useful for avoiding pitfalls. Compare this to the old way, which is to rack my brain over a weekend to figure out a water-tight formulation of MIP optimization problem (which is often not straightforward for non-intuitive problems). GPT has saved me so much time in this corner of my world.
Yes, you probably wouldn't be able to use ChatGPT well for this purpose unless you understood MIP optimization in the first place -- and you do need to break down the problem into smaller chunks so GPT can reason in steps -- but for someone who can and does, the $20/month I pay for ChatGPT more than pays for itself.
side: a lot of people who complain on HN that (paid/good - only Sonnet 3.5 and GPT4o are in this category) LLMs are useless to them probably (1) do not know how to use LLMs in way that maximizes their strengths; (2) have expectations that are too high based on the hype, expecting one-shot magic bullets. (3) LLMs are really not good for their domain. But many of the low-effort comments seem to mostly fall into (1) and (2) -- cynicism rather than cautious optimism.
Many of us who have discovered how to exploit LLMs in their areas of strength -- and know how to check for their mistakes -- often find them providing significant leverage in our work.