![]() |
|
![]() |
| This is why most of the best software is written by people writing things for themselves and most of the worst is made by people making software they don't use themselves. |
![]() |
| But in that case, finding the solution is hard and you generally don't try. Instead, you try to get fairly close, and it's more difficult to verify that you've done so. |
![]() |
| Lol. Literally.
If you have those many well written tests, you can pass them to a constraint solver today and get your program. No LLM needed. Or even run your tests instead of the program. |
![]() |
| Code coverage exists. Shouldn't be hard at all to tune the parameters to get what you want. We have really good tools to reason about code programmatically - linters, analyzers, coverage, etc. |
![]() |
| If you have a test that completes with the expected outcome and hits the expected code paths you have a working test - I'd say that heuristic will get you really close with some tweaks. |
![]() |
| This reads as a proper marketing ploy. If the current incarnation of AI + coding is anything to go by - it'll take leaps just to make it barely usable (or correct) |
![]() |
| Isn't this a good reward function for RL? Take a codebase's test suite. Rip out a function, let the LLM rewrite the function, benchmark it and then RL it using the benchmark results. |
![]() |
| The output most pleasing to a human, which is both better and worse.
Better, when we spot mistakes even if we couldn't create the work with the error. Think art: most of us can't draw hands, but we can spot when Stable Diffusion gets them wrong. Worse also, because there are many things which are "common sense" and wrong, e.g. https://en.wikipedia.org/wiki/Category:Paradoxes_in_economic..., and we would collectively down-vote a perfectly accurate model of reality for violating our beliefs. |
![]() |
| Neither do these models. The calculations I saw claiming some absurdly high energy or water use seemed like an absolute joke. Par for the course for a journalist at this point. |
![]() |
| > if you include the desired score in your prompt, the model will now strive to produce an answer that is consistent with that score
But you need a model to generate score from answer, and then fine-tune another model to generate answer conditioned on score. The first time the score is at the end and the second time at the beginning. It's how DecisionTransformer works too, it constructs a sequence of (reward, state, action) where reward conditions on the next action. https://arxiv.org/pdf/2106.01345 By the same logic you could generate tags, including style, author, venue and date. Some will be extracted from the source document, the others produced with classifiers. Then you can flip the order and finetune a model that takes the tags before the answer. Then you got a LLM you can condition on author and style. |
![]() |
| Deepmind's recent model is trained with Lean. It scored a silver olympiad medal (and only one point away from gold).
> AlphaProof is a system that trains itself to prove mathematical statements in the formal language Lean. It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself how to master the games of chess, shogi and Go https://deepmind.google/discover/blog/ai-solves-imo-problems... |
![]() |
| The final conclusion though stands without any justification - that LLM + RL will somehow out-perform people at open-domain problem solving seems quite a jump to me. |
![]() |
| I think the point is that it's practically impossible to correctly perform RLHF in open domains, so comparisons simply can't happen. |
![]() |
| To be fair, it says "has a real shot at" and AlphaGo level. AlphaGo clearly beat humans on Go, so thinking that if you could replicate that, it would have a shot doesn't seem crazy to me |
![]() |
| The SPAG paper is an interesting example of true reinforcement learning using language models that improves their performance on a number of hard reasoning benchmarks. https://arxiv.org/abs/2404.10642
The part that is missing from Karpathy's rant is "at scale" (the researchers only ran 3 iterations of the algorithm on small language models) and in "open domains" (I could be wrong about this but IIRC they ran their games on a small number of common english words). But adversarial language games seem promising, at least. |
![]() |
| > An LLM, could not, for example, definitively come up with the three laws of planetary motion like Kepler did (he looked at the data)
You could use Symbolic Regression instead, and the LLM will write the code. Under the hood it would use a genetic programming library like with SymbolicRegressor. Found a reference: > AI-Descartes, an AI scientist developed by researchers at IBM Research, Samsung AI, and the University of Maryland, Baltimore County, has reproduced key parts of Nobel Prize-winning work, including Langmuir’s gas behavior equations and Kepler’s third law of planetary motion. Supported by the Defense Advanced Research Projects Agency (DARPA), the AI system utilizes symbolic regression to find equations fitting data, and its most distinctive feature is its logical reasoning ability. This enables AI-Descartes to determine which equations best fit with background scientific theory. The system is particularly effective with noisy, real-world data and small data sets. The team is working on creating new datasets and training computers to read scientific papers and construct background theories to refine and expand the system’s capabilities. https://scitechdaily.com/ai-descartes-a-scientific-renaissan... |
![]() |
| Perhaps not entirely open domain, but I have high hopes for “real RL” in coding, where you can get a reward signal from compile/runtime errors and tests. |
![]() |
| Interesting, has anyone been doing this? I.e. training/fine-tuning an LLM against an actual coding environment, as opposed to just tacking that later on as a separate "agentic" contruct? |
![]() |
| I agree that it's a very difficult problem. I'd like to mention AlphaDev [0], an RL algorithm that builds other algorithms, there they combined the measure of correctness and a measure of algorithm speed (latency) to get the reward. But the algorithms they built were super small (e.g., sorting just three numbers), therefore they could measure correctness using all input combinations. It is still unclear how to scale this to larger problems.
[0] https://deepmind.google/discover/blog/alphadev-discovers-fas... |
![]() |
| My takeaway is that it's difficult to make a "generic enough" evaluation that encompasses all things we use an LLM for, e.g. code, summaries, jokes. Something with free lunches. |
![]() |
| > I wouldn't say it sucks. You just need to keep training it for as long as needed.
As that timeline can approach infinity, just adding extra training may not actually be a sufficient compromise. |
![]() |
| Been shouting this for over a year now. We're training AI to be convincing, not to be actually helpful. We're sampling the wrong distributions. |
![]() |
| Depends on who you ask.
Advertisement and propaganda is not necessarily helpful for consumers, but just needs to be convincing in order to be helpful for producers. |
![]() |
| If they convinced me of their helpfulness, and their output is actually helpful in solving my problems.. well, if it walks like a duck and quacks like a duck, and all that. |
![]() |
| While I agree to Karpathy and I also had a "wut? They call this RL? " reaction when RLHF was presented as an method of CHATGPT training, I'm a bit surprised by the insight he makes because this same method and insight have been gathered from
"Learning from Human preference" [1] from none other than openAI, published in 2017.
Sometimes judging a "good enough" policy is order of magnitudes more easier than formulating an exact reward function, but this is pretty much domain and scope dependent. Trying to estimate a reward function in those situations, can often be counter productive because the reward might even screw up your search direction. This observation was also made by the authors (researchers) of the book "Myth of the objective"[2] with their picbreeder example. (the authors so happens to also work for OpenAI.) When you have a well defined reward function with no local suboptima and no cost in rolling out faulty policies RL work remarkably well. (Alex Ipran described this well in his widely cited blog [3]) Problem is that this is pretty hard requirements to have for most problems that interact with the real world (and not internet, the artificial world). It's either the suboptima that is in the way (LLM and text), or rollout cost (running GO game a billion times to just beat humans, is currently not a feasible requirement for a lot of real world applications) Tangentially, this is also why I suspect LLM for planning (and understanding the world) in the real world have been lacking. Robot Transformer and SayCan approaches are cool but if you look past the fancy demos it is indeed a lackluster performance. It will be interesting to see how these observations and Karpathys observations will be tested with the current humanoid robot hype, which imo is partially fueled by a misunderstanding of LLMs capacity including what Karpathy mentioned. (shameless plug: [4]) [1] https://openai.com/index/learning-from-human-preferences/ [2] https://www.lesswrong.com/posts/pi4owuC7Rdab7uWWR/book-revie... |
I expect language models to also get crazy good at mathematical theorem proving. The search space is huge but theorem verification software will provide 100% accurate feedback that makes real reinforcement learning possible. It's the combination of vibes (how to approach the proof) and formal verification that works.
Formal verification of program correctness never got traction because it's so tedious and most of the time approximately correct is good enough. But with LLMs in the mix the equation changes. Having LLMs generate annotations that an engine can use to prove correctness might be the missing puzzle piece.