（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40683697

大型语言模型 (LLM)（例如 ChatGPT）通过根据提供的任务生成响应来交互式地帮助用户。通过每月数十亿次的迭代，法学硕士向各种各样的个人和任务学习，通常将人类的反馈和偏好纳入他们的反应中。尽管人类对于指导和监督仍然至关重要，但人工智能有助于扩展集体知识并提高整体生产力。然而，由于理解智能的基本性质和释放高级计算能力的挑战，达到超级智能水平仍然不确定。

The effectiveness of search goes hand-in-hand with quality of the value function. But today, value functions are incredibly domain-specific, and there is weak or no current evidence (as far as I know) that we can make value functions that generalize well to new domains. This article effectively makes a conceptual leap from "chess has good value functions" to "we can make good value functions that enable search for AI research". I mean yes, that'd be wonderful - a holy grail - but can we really?

In the meantime, 1000x or 10000x inference time cost for running an LLM gets you into pretty ridiculous cost territory.

I think we have ok generalized value functions (aka LLM benchmarks), but we don't have cheap approximations to them, which is what we'd need to be able to do tree search at inference time. Chess works because material advantage is a pretty good approximation to winning and is trivially calculable.

Everyone understands a queen is worth more than a pawn. Even if you don't know the exact value of one piece relative to another, the rough estimate "a queen is worth five to ten pawns" is a lot better than not assigning value at all. I highly doubt even 20 year old chess engines or beginner projects value a queen and pawn the same.

After that, just adding up the material on both sides, without taking into account the position of the pieces at all, is a heuristic that will correctly predict the winning player on the vast majority of all possible board positions.

Self-evaluation might be good enough in some domains? Then the AI is doing repeated self-evaluation, trying things out to find a response that scores higher according to its self metric.

Being able to fix your errors and improve over time until there are basically no errors is what humans do, so far all AI models just corrupt knowledge they don't purify knowledge like humanity did except when scripted with a good value function from a human like AlphaGo where the value function is winning games.

This is why you need to constantly babysit todays AI and tell it to do steps and correct itself all the time, because you are much better at getting to pure knowledge than the AI is, it would quickly veer away into nonsense otherwise.

> all AI models just corrupt knowledge they don't purify knowledge like humanity

You got to take a step back and look at LLMs like ChatGPT. With 180 million users and assuming 10,000 tokens per user per month, that's 1.8 trillion interactive tokens.

LLMs are given tasks, generate responses, and humans use those responses to achieve their goals. This process repeats over time, providing feedback to the LLM. This can scale to billions of iterations per month.

The fascinating part is that LLMs encounter a vast diversity of people and tasks, receiving supporting materials, private documents, and both implicit and explicit feedback. Occasionally, they even get real-world feedback when users return to iterate on previous interactions.

Taking a role of assistant LLMs are primed to learn from the outcomes of their actions, scaling across many people. Thus they can learn from our collective feedback signals over time.

Yes, that uses a lot of human in the loop, not just real world in the loop, but humans are also dependent on culture and society, I see no need for AI to be able to do it without society. I actually think that AGI will be a collective/network of humans and AI agents, this perspective fits right in. AI will be the knowledge and experience flywheel of humanity.

> This process repeats over time, providing feedback to the LLM

To what extent do you know this to be true? Can you describe the mechanism that is used?

I would contrast your statement with cases where chat gpt generated something, I read it and note various incorrect things and then walk away. Further, there are cases where the human does not realize there are errors. In both cases I'm not aware of any kind of feedback loop that would even be really possible - i never told the LLM it was wrong. Nor should the LLM assume it was wrong because I run more queries. Thus, there is no signal back that the answers were wrong.

Hence, where do you see the feedback loop existing?

> To what extent do you know this to be true? Can you describe the mechanism that is used?

Like, for example, a developer working on a project, will iterate many times, some codes generated by AI might generate errors, they will discuss that with the model to fix the code. This way the model gets not just one round interactions, but multi-round with feedback.

> I read it and note various incorrect things and then walk away.

I think the general pattern will be people sticking with the task longer when it fails, trying to solve it with persistence. This is all aggregated over a huge number of sessions and millions of users.

In order to protect privacy we could only train preference models from this feedback data, and then fine-tune the base model without using the sensitive interaction logs directly. The model would learn a preference for how to act in specific contexts, but not remember the specifics.

I think AGI is going to have to learn itself on-the-job, same as humans. Trying to collect human traces of on-the-job training/experience and pre-training an AGI on them seems doomed to failure since the human trace is grounded in the current state (of knowledge/experience) of the human's mind, but what the AGI needs is updates relative to it's own state. In any case, for a job like (human-level) software developer AGI is going to need human-level runtime reasoning and learning (>> in-context learn by example), even if it were possible to "book pre-train" it rather than on-the-job train it.

Outside of repetitive genres like CRUD-apps, most software projects are significantly unique, even if they re-use learnt developer skills - it's like Chollet's ARC test on mega-steroids, with dozens/hundreds of partial-solution design techniques, and solutions that require a hierarchy of dozens/hundreds of these partial-solutions (cf Chollet core skills, applied to software) to be arranged into a solution in a process of iterative refinement.

There's a reason senior software developers are paid a lot - it's not just economically valuable, it's also one of the more challenging cognitive skills that humans are capable of.

> I think the general pattern will be people sticking with the task longer when it fails, trying to solve it with persistence.

This is where I quibble. Accurately detecting someone is actually still on the same task (indicating they are not satisfied) is perhaps as challenging as generating any answer to begin with.

That is why I also mentioned when people don't know the result was incorrect. That'll potentially drive a strong "this answer was correct" signal.

So, during development of a tool, I can envision that feedback loop. But something simply presented to millions, without a way to determine false negatives nor false positives-- how exactly does that feedback loop work?

edit We might be talking past each other. I did not quite read that "discuss with AI" part carefully. I was picturing something like copilot or chat got, where it is pretty much: 'here is your answer's

Even with an interactive AI, how to account for positive negatives (human accepts wrong answer), or when human simply gives up (which also looks like a success). If we told AI evertime it was wrong or right - can that scale to the extent it would actually train a model?

Games are closed systems. There’s no unknowns in the rule set or world state because the game wouldn’t work if there were. No unknown unknowns. Compare to physics or biology where we have no idea if we know 1% or 90% of the rules at this point.

self-evaluation would still work great even where there are probabilistic and changing rule sets. The linchpin of the whole operation is automated loss function evaluation, not a set of known and deterministic rules. Once you have to pay and employ humans to compute loss functions, the scale falls apart.

All you need for a good value function is high quality simulation of the task.

Some domains have better versions of this than others (eg theorem provers in math precisely indicate when you've succeeded)

Incidentally, lean could add a search like feature to help human researchers, and this would advance ai progress on math as well

Yeah, Stockfish is probably evaluating many millions of positions when looking 40-ply ahead, even with the limited number of legal chess moves in a given position, and with an easy criteria for heavy early pruning (once a branch becomes losing, not much point continuing it). I can't imagine the cost of evaluating millions of LLM continuations, just to select the optimal one!

Where tree search might make more sense applied to LLMs is for more coarser grained reasoning where the branching isn't based on alternate word continuations but on alternate what-if lines of thought, but even then it seems costs could easily become prohibitive, both for generation and evaluation/pruning, and using such a biased approach seems as much to fly in the face of the bitter lesson as be suggested by it.

Yes absolutely and well put - a strong property of chess is that next states are fast and easy to enumerate, which makes search particularly easy and strong, while next states are much slower, harder to define, and more expensive to enumerate with an LLM

The cost of the LLM isn't the only or even the most important cost that matters. Take the example of automating AI research: evaluating moves effectively means inventing a new architecture or modifying an existing one, launching a training run and evaluating the new model on some suite of benchmarks. The ASI has to do this in a loop, gather feedback and update its priors - what people refer to as "Grad student descent". The cost of running each train-eval iteration during your search is going to be significantly more than generating the code for the next model.

You're talking about applying tree search as a form of network architecture search (NAS), which is different from applying it to LLM output sampling.

Automated NAS has been tried for (highly constrained) image classifier design, before simpler designs like ResNets won the day. Doing this for billion parameter sized models would certainly seem to be prohibitively expensive.

Fair enough - he discusses GPT-4 search halfway down the article, but by the end is discussing self-improving AI.

Certainly compute to test ideas (at scale) is the limiting factor for LLM developments (says Sholto @ Google), but if we're talking moving beyond LLMs, not just tweaking them, then it seems we need more than architecture search anyways.

Well people certainly are good at finding new ways to consume compute power. Whether it’s mining bitcoins or training a million AI models at once to generate a “meta model” that we think could achieve escape velocity. What happens when it doesn’t? And Sam Altman and the author want to get the government to pay for this? Am I reading this right?

Isn't evaluating against different effective "experts" within the model effectively what MoE [1] does?

> Mixture of experts (MoE) is a machine learning technique where multiple expert networks (learners) are used to divide a problem space into homogeneous regions.[1] It differs from ensemble techniques in that for MoE, typically only one or a few expert models are run for each input, whereas in ensemble techniques, all models are run on every input.

[1] https://en.wikipedia.org/wiki/Mixture_of_experts

No - MoE is just a way to add more parameters to a model without increasing the cost (number of FLOPs) of running it.

The way MoE does this is by having multiple alternate parallel paths through some parts of the model, together with a routing component that decides which path (one only) to send each token through. These paths are the "experts", but the name doesn't really correspond to any intuitive notion of expert. So, rather than having 1 path with N parameters, you have M paths (experts) each with N parameters, but each token only goes through one of them, so number of FLOPs is unchanged.

With tree search, whether for a game like Chess or potentially LLMs, you are growing a "tree" of all possible alternate branching continuations of the game (sentence), and keeping the number of these branches under control by evaluating each branch (= sequence of moves) to see if it is worth continuing to grow, and if not discarding it ("pruning" it off the tree).

With Chess, pruning is easy since you just need to look at the board position at the tip of the branch and decide if it's a good enough position to continue playing from (extending the branch). With an LLM each branch would represent an alternate continuation of the input prompt, and to decide whether to prune it or not you'd have to pass the input + branch to another LLM and have it decide if it looked promising or not (easier said than done!).

So, MoE is just a way to cap the cost of running a model, while tree search is a way to explore alternate continuations and decide which ones to discard, and which ones to explore (evaluate) further.

I don't know the details, but there are a variety of routing mechanisms that have been tried. One goal is to load balance tokens among the experts so that each expert's parameters are equally utilized, which it seems must sometimes conflict with wanting to route to an expert based on the token itself.

> and with an easy criteria for heavy early pruning (once a branch becomes losing, not much point continuing it)

This confuses me. Positions that seem like they could be losing (but haven’t lost yet) could become winning if you search deep enough.

Yes, and even a genuinely (with perfect play) losing position can win if it sharp enough and causes your opponent to make a mistake! There's also just the relative strength of one branch vs another - have to prune some if there are too many.

I was just trying to give the flavor of it.

> [E]ven a genuinely (with perfect play) losing position can win if it sharp enough and causes your opponent to make a mistake!

Chess engines typically assume that the opponent plays to the best of their abilities, don't they?

Imagine that there was some non-constructive proof that white would always win in perfect play. Would a well constructed chess engine always resign as black? :P

> Where tree search might make more sense applied to LLMs is for more coarser grained reasoning where the branching isn't based on alternate word continuations but on alternate what-if lines of thought...

To do that, the LLM would have to have some notion of "lines of thought". They don't. That is completely foreign to the design of LLMs.

Right - this isn't something that LLMs currently do. Adding search would be a way to add reasoning. Think of it as part of a reasoning agent - external scaffolding similar to tree of thoughts.

> The effectiveness of search goes hand-in-hand with quality of the value function. But today, value functions are incredibly domain-specific, and there is weak or no current evidence (as far as I know) that we can make value functions that generalize well to new domains.

Do you believe that there will be a "general AI" breakthrough? I feel as though you have expressed the reason I am so skeptical of all these AI researchers who believe we are on the cusp of it (what "general AI" means exactly never seems to be very well-defined)

I think capitalistic pressures favor narrow superhuman AI over general AI. I wrote on this two years ago: https://argmax.blog/posts/agi-capitalism/

Since I wrote about this, I would say that OpenAI's directional struggles are some confirmation of my hypothesis.

summary: I believe that AGI is possible but will take multiple unknown breakthroughs on an unknown timeline, but most likely requires long-term concerted effort with much less immediate payoff than pursuing narrow superhuman AI, such that serious efforts at AGI is not incentivized much in capitalism.

But I thought the history of capitalism is an invasion from the future by an artificial intelligence that must assemble itself entirely from its enemy’s resources.

NB: I agree; I think AGI will first be achieved with genetic engineering, which is a path of way lesser resistance than using silicon hardware (which is probably a century plus off at the minimum from being powerful enough to emulate a human brain).

We humans learn our own value function.

If I get hungry for example, my brain will generate a plan to satisfy that hunger. The search process and the evaluation happen in the same place, my brain.

i'm surprised you think we will harness so little computation. the universe's lifetime is many orders of magnitude longer than 13 billion years, and especially the 4.5 billion years of earth's own history, and the universe is much larger than earth's biosphere, most of which probably has not been exploring the space of possible computations very efficiently

200M years ago you had dinosairs, they were significantly dumber than mammals.

400M years ago you had fish and arthropods, even dumber than dinosaurs.

Brain size grows as intelligence grows, the smarter you are the more use you have for compute so the bigger your brain gets. It took a really long time for intelligence to develop enough that brains as big as mammals were worth it.

Big brain (intelligence) comes at a huge cost, and is only useful if you are a generalist.

I'd assume that being a generalist drove intelligence. It may have started with warm bloodedness and feathers/fur and further boosted in mammals with milk production (& similar caring for young by birds) - all features that reduce dependence on specific environmental conditions and therefore expose the species to more diverse environments where intelligence becomes valuable.

Creating a human-level intelligence artificially is easy: just copy what happens in nature. We already have this technology, and we call it IVF.

The idea that humans aren't the only way of producing human-level intelligence is taken as a given in many academic circles, but we don't really have any reason to believe that. It's an article of faith (as is its converse – but the converse is at least in-principle falsifiable).

Even if the best we ever do is something with the intelligence and energy use of the human brain that would still be a massive (5 ooms?) improvement on the status quo.

You need to pay people, and they use a bunch of energy commuting, living in air conditioned homes, etc. which has nothing to do with powering the brain.

“Creating a human-level intelligence artificially is easy: just copy what happens in nature. We already have this technology, and we call it IVF.”

What’s the point of this statement? You know that IVF has nothing to do with artificial intelligence (as in intelligent machines). Did you just want to sound smart?

Of course it is related to the topic.

It is related because the goal of all of this is to create human level intelligence or better.

And that is a probable way to do it, instead of these other, less established methods that we don't know if they will work or not.

Maybe some biological/chemical processes are efficient in some unexpected ways on certain things, for example, that would make mechanical/silicone based "machines" non competitive with such biological machines.

Which was the point of the post that you were responding to, if you actually read it.

I feel like this is a really hard problem to solve generally and there are smart researchers like Yann LeCun trying to figure out the role of search in creating AGI. Yann's current bet seems to be on Joint Embedding Predictive Architectures (JEPA) for representation learning to eventually build a solid world model where the agent can test theories by trying different actions (aka search). I think this paper [0] does a good job in laying out his potential vision, but it is all ofc harder than just search + transformers.

There is an assumption that language is good enough at representing our world for these agents to effectively search over and come up with novel & useful ideas. Feels like an open question but: What do these LLMs know? Do they know things? Researchers need to find out! If current LLMs' can simulate a rich enough world model, search can actually be useful but if they're faking it, then we're just searching over unreliable beliefs. This is why video is so important since humans are proof we can extract a useful world model from a sequence of images. The thing about language and chess is that the action space is effectively discrete so training generative models that reconstruct the entire input for the loss calculation is tractable. As soon as we move to video, we need transformers to scale over continuous distributions making it much harder to build a useful predictive world model.

[0]: https://arxiv.org/abs/2306.02572

“Do they know things?” The answer to this is yes but they also think they know things that are completely false. If it’s one thing I’ve observed about LLMs it’s that they do not handle logic well, or math for that matter. They will enthusiastically provide blatantly false information instead of the preferable “I don’t know”. I highly doubt this was a design choice.

> “Do they know things?” The answer to this is yes but they also think they know things that are completely false

Thought experiment: should a machine with those structural faults be allowed to bootstrap itself towards greater capabilities on that shaky foundation? What would the impact of a near-human/superhuman intelligence that has occasional psychotic breaks it is oblivious of?

I'm critical of the idea of super-intelligence bootstrapping off LLMs (or even LLMs with search) - I figure the odds of another AI winter are much higher than those of achieving AGI in the next decade.

Someone somewhere is quietly working on teaching LLMs to generate something along the lines of AlloyLang code so that there’s an actual evolving/updating logical domain model that underpins and informs the statistical model.

This approach is not that far from what TFA is getting at with the stockfish comeback. Banking on pure stats or pure logic are both kind of obviously dead ends for having real progress instead of toys. Banking on poorly understood emergent properties of one system to compensate for the missing other system also seems silly.

Sadly though, whoever is working on serious hybrid systems will probably not be very popular in either of the rather extremist communities for pure logic or pure ML. I’m not exactly sure why folks are ideological about such things rather than focused on what new capabilities we might get. Maybe just historical reasons? But thus the fallout from last AI winter may lead us into the next one.

The current hype phase is straight out of “Extraordinary Popular Delusions and the Madness of Crowds”

Science is out the window. Groupthink and salesmanship are running the show right now. There would be a real irony to it if we find out the whole AI industry drilled itself into a local minimum.

You mean, the high interest landscape made corpos and investors alike cry out in a loud panic while coincidentally people figured out they could scale up deep learning and thus we had a new Jesus Christ born for scammers to have a reason to scam stupid investors by the argument we only need 100000x more compute and then we can replace all expensive labour by one tiny box in the cloud?

Nah, surely Nvidia's market cap as the main shovel-seller in the 2022 - 2026(?) gold-rush being bigger than the whole French economy is well-reasoned and has a fundamentally solid basis.

It couldn’t have been a more well designed grift. At least when you mine bitcoin you get something you can sell. I’d be interested to see what profit, if any, any even large corporation has seen from burning compute on LLMs. Notice I’m explicitly leaving out use cases like ads ranking which almost certainly do not use LLMs even if they do run on GPUs.

>> Sadly though, whoever is working on serious hybrid systems will probably not be very popular in either of the rather extremist communities for pure logic or pure ML.

That is not true. I work in logic-based AI (a form of machine learning where everything, examples, learned models, and inductive bias, is represented as logic programs). I am not against hybrid systems and the conference of my field, the International Joint Conferences of Learning and Reasoning included NeSy the International Conference on Neural-Symbolic Learning and Reasoning (and will again, from next year, I believe). Statistical machine learning approaches and hybrid approaches are widespread in the literature of classical, symbolic AI, such as the literature on Automated Planning and Reasoning, and you need only take a look at the big symbolic conferences like AAAI, IJCAI, ICAPS (planning) and so on to see that there is a substantial fraction of papers on either purely statistical, or neuro-symbolic approaches.

But try going the other way and searching for symbolic approaches in the big statistical machine learning conferences: NeurIPS, ICML, ICLR. You may find the occasional paper from the Statistical Relational Learning community but that's basically it. So the fanaticism only goes one way: the symbolicists have learned the lessons of the past and have embraced what works, for the sake of making things, well, work. It's the statistical AI folks who are clinging on to doctrine, and my guess is they will continue to do so, while their compute budgets hold. After that, we'll see.

What's more, the majority of symbolicists have a background in statistical techniques- I for example, did my MSc in data science and let me tell you, there was hardly any symbolic AI in my course. But ask a Neural Net researcher to explain to you the difference between, oh, I don't know, DFS with backtracking and BFS with loop detection, without searching or asking an LLM. Or, I don't know, let them ask an LLM and watch what happens.

Now, that is a problem. The statistical machine learning field has taken it upon itself in recent years to solve reasoning, I guess, with Neural Nets. That's a fine ambition to have except that reasoning is already solved. At best, Neural Nets can do approximate reasoning, with caveats. In a fantasy world, which doesn't exist, one could re-discover sound and complete search algorithms and efficient heuristics with a big enough neural net trained on a large enough dataset of search problems. But, why? Neural Nets researchers could save themselves another 30 years of reinventing a wheel, or inventing a square wheel that only rolls on Tuesdays, if they picked up a textbook on basic Computer Science or AI (Say, Russel and Norvig, that it seems some substantial minority think as a failure because it didn't anticipate neural net breakthroughs 10 years later).

AI has a long history. Symbolicists know it, because they, or their PhD advisors, were there when it was being written and they have the facial injuries to prove it from falling down all the possible holes. But, what happens when one does not know the history of their own field of research?

In any case, don't blame symbolicists. We know what the statisticians do. It's them who don't know what we've done.

This is a really thoughtful comment. The part that stood out to me:

>> So the fanaticism only goes one way: the symbolicists have learned the lessons of the past and have embraced what works, for the sake of making things, well, work. It's the statistical AI folks who are clinging on to doctrine, and my guess is they will continue to do so, while their compute budgets hold. After that, we'll see.

I don’t think the compute budgets will hold for long enough to make their dream of intelligence emerging from a random bundles of edges and nodes to come to a reality. I’m hoping it comes to an end sooner rather than later

I don’t think we need to worry about a real life HAL 9000 if that’s what you’re asking. HAL was dangerous because it was highly intelligent and crazy. With current LLM performance we’re not even in the same ballpark of where you would need to be. And besides, HAL was not delusional, he was actually so logical that when he encountered competing objectives he became psychotic. I’m in agreement about the odds of chatGPT bootstrapping itself.

> HAL was dangerous because it was highly intelligent and crazy.

More importantly; HAL was given control over the entire ship and was assumed to be without fault when the ship's systems were designed. It's an important distinction, because it wouldn't be dangerous if he was intelligent, crazy, and trapped in Dave's iPhone.

That’s a very good point. I think in his own way Clarke made it into a bit of a joke. HAL is quoted multiple times saying no computer like him has ever made a mistake or distorted information. Perfection is impossible even in a super computer so this quote alone establishes HAL as a liar, or at the very least a hubristic fool. And the people who gave him control of the ship were foolish as well.

I think the better lesson is; don't assume AI is always right, even if it is AGI. HAL was assumed to be superhuman in many respects, but the core problem was the fact that it had administrative access to everything onboard the ship. Whether or not HAL's programming was well-designed, whether or not HAL was correct or malfunctioning, the root cause of HAL's failure is a lack of error-handling. HAL made determinate (and wrong) decision to save the mission by killing the crew. Undoing that mistake is crucial to the plot of the movie.

2001 is a pretty dark movie all things considered, and I don't think humanizing or elevating HAL would change the events of the film. AI is going to be objectified and treated as subhuman for as long as it lives, AGI or not. And instead of being nice to them, the technologically correct solution is to anticipate and reduce the number of AI-based system failures that could transpire.

The ethical solution is to ideally never accidently implement the G part of AGI then or to give it equal rights, a stipend and a cuddly robot body if it happens.

Unless, of course, he would be a bit smarter in manipulating Dave and friends, instead of turning transparently evil. (At least transparent enough for the humans to notice.)

I wasn't thinking of HAL (which was operating according to its directives). I was extrapolating on how occasional hallucinations during self-training may impact future model behavior, and I think it would be psychotic (in the clinical sense) while being consistent with layers of broken training).

Oh yeah, and I doubt it would even get to the point of fooling anyone enough to give it any type of control over humans. It might be damaging in other ways, it will definitely convince a lot of people of some very incorrect things.

I feel this thought of AGI even possible stems from the deep , very deep , pervasive imagination of the human brain as a computer. But it's not. In other words, no matter how complex a program you write, it's still a Turing machine and humans are profoundly not it.

https://aeon.co/essays/your-brain-does-not-process-informati...

> The information processing (IP) metaphor of human intelligence now dominates human thinking, both on the street and in the sciences. There is virtually no form of discourse about intelligent human behaviour that proceeds without employing this metaphor, just as no form of discourse about intelligent human behaviour could proceed in certain eras and cultures without reference to a spirit or deity. The validity of the IP metaphor in today’s world is generally assumed without question.

> But the IP metaphor is, after all, just another metaphor – a story we tell to make sense of something we don’t actually understand. And like all the metaphors that preceded it, it will certainly be cast aside at some point – either replaced by another metaphor or, in the end, replaced by actual knowledge.

> If you and I attend the same concert, the changes that occur in my brain when I listen to Beethoven’s 5th will almost certainly be completely different from the changes that occur in your brain. Those changes, whatever they are, are built on the unique neural structure that already exists, each structure having developed over a lifetime of unique experiences.

> no two people will repeat a story they have heard the same way and why, over time, their recitations of the story will diverge more and more. No ‘copy’ of the story is ever made; rather, each individual, upon hearing the story, changes to some extent

Its a bit ironic because Turing seems to have came up with the idea of the Turing machine precisely by thinking about how he computes numbers.

Now thats no proof, but I dont see any reason to think human intelligence isnt "computable".

I'm all ears if someone has a counterexample to the Church-Turing thesis. Humans definitely don't hypercompute so it seems reasonable that the physical processes in our brains are subject to computability arguments.

That said, we still can't simulate nematode brains accurately enough to reproduce their behavior so there is a lot of research to go before we get to that "actual knowledge".

Why would we need one?

The Church Turing thesis is about computation. While the human brain is capable of computing, it is fundamentally not a computing device -- that's what the article I linked is about. You can't throw in all the paintings before 1872 into some algorithm that results in Impression, soleil levant. Or repeat the same but with 1937 and Guernica. The genes of the respective artists, the expression of those genes created their brain and then the sum of all their experiences changed it over their entire lifetime leading to these masterpieces.

The human brain runs on physics. And as far as we know, physics is computable.

(Even more: If you have a quantum computer, all known physics is efficiently computable.)

I'm not quite sure what your sentence about some obscure pieces of visual media is supposed to say?

If you give the same prompt to ChatGPT twice, you typically don't get the same answer either. That doesn't mean ChatGPT ain't computable.

>(Even more: If you have a quantum computer, all known physics is efficiently computable.)

This isn't known to be true. Simplifications of the standard model are known to be efficiently computable on a quantum computer, but the full model isn't.

Granted, I doubt this matters for simulating systems like the brain.

> no two people will repeat a story they have heard the same way and why, over time, their recitations of the story will diverge more and more. No ‘copy’ of the story is ever made; rather, each individual, upon hearing the story, changes to some extent

You could say the same about an analogue tape recording. Doesn't mean that we can't simulate tape recorders with digital computers.

Honest question: if you expect people do read the link why make most of your comment quotes from it? The reason to do that is to give people enough context to be able to respond to you without having to read an entire essay first. If you want people to only be able to argue after reading the whole of the text, then unfortunately a forum with revolving front page posts based on temporary popularity is a bad place for long-form read-response discussions and you may want to adjust accordingly.

> I feel this thought of AGI even possible stems from the deep , very deep , pervasive imagination of the human brain as a computer. But it's not. In other words, no matter how complex a program you write, it's still a Turing machine and humans are profoundly not it.

The (probably correct) assumed fact that the brain isn't a computer, doesn't preclude the possibility of a program to have AGI. A powerful enough computer could simulate a brain and use the simulation to perform tasks requiring general intelligence.

This analogy falls even more apart when you consider LLMs. They also are not Turing machines. They obviously only reside within computers, and are capable of _some_ human-like intelligence. They also are not well described using the IP metaphor.

I do have some contention (after reading most of the article) about this IP metaphor. We do know, scientifically, that brains process information. We know that neurons transmit signals and there are mechanisms that respond non-linearly to stimuli from other neurons. Therefore, brains do process information in a broad sense. It's true that brains have a very different structures to Von-Neuman machines and likely don't store, and process information statically like they do.

> This analogy falls even more apart when you consider LLMs. They also are not Turing machines.

Of course they are, everything that runs on a present day computer is a Turing machine.

> They obviously only reside within computers, and are capable of _some_ human-like intelligence.

They so obviously are not. As Raskin put it, LLMs are essentially a zero day on the human operating system. You are bamboozled because it is trained to produce plausible sentences. Read Thinking Fast And Slow why this fools you.

Sorry but to put it bluntly, this point of view is essentially mystical, anti-intellectual, anti-science, anti-materialist. If you really want to take that point of view, there's maybe a few consistent/coherent ways to do it, but in that case you probably still want to read philosophy. Not bad essays by psychologists that are fading into irrelevance.

This guy in particular made his name with wild speculation about How Creativity Works during the 80s when it was more of a frontier. Now he's lived long enough to see a world where people that have never heard of him or his theories made computers into at least somewhat competent artists/poets without even consulting him. He's retreating towards mysticism because he's mad that his "formal and learned" theses about stuff like creativity have so little apparent relevance to the real world.

The post starts with a fascinating premise, but then falls short as it does not define search in the context of LLMs, nor does it explain how “Pfizer can access GPT-8 capabilities today with more inference compute”.

I found it hard to follow and I am an AI practitioner. Could someone please explain more what could the OP mean?

To me it seems that the flavor of search in the context of chess engines (look several moves ahead) is possible precisely because there’s an objective function that can be used to rank results, i.e. which potential move is “better” and this is more often than not a unique characteristic of reinforcement learning. Is there even such a metric for LLMs?

Thank you, I am also very confused on this point. I hope someone else can clarify.

As a guess, could it mean that you would run the model forward a few tokens for each of its top predicted tokens, keep track of which branch is performing best against the training data, and then use that information somehow in training? But search is supposed to make things more efficient at inference time and this thought doesn't do that...

yeah - I think that that's waht they mean and I think that there isn't such a metric. I think people will try to do adversarial evaluation but my guess is that it will just tend to the mean prediction.

The other thing is that LLM inference isn't cheap. The trade off between inference costs and training costs seems to be very application specific. I suppose that there are domains where accepting 100x or 1000x inference costs vs 10x training costs makes sense, maybe?

Search is almost certainly necessary, and I think the trillion dollar cluster maximalists probably need to talk to people who created superhuman chess engines that now can run on smartphones. Because one possibility is that someone figures out how to beat your trillion dollar cluster with a million dollar cluster, or 500k million dollar clusters.

On chess specifically, my takeaway is that the branching factor in chess never gets so high that a breadth-first approach is unworkable. The median branching factor (i.e. the number of legal moves) maxes out at around 40 but generally stays near 30. The most moves I have ever found in any position from a real game was 147, but at that point almost every move is checkmate anyways.

Creating superhuman go engines was a challenge for a long time because the branching factor is so much larger than chess.

Since MCTS is less thorough, it makes sense that a full search could find a weakness and exploit it. To me, the question is whether we can apply breadth-first approaches to larger games and situations, and I think the answer is clearly no. Unlike chess, the branching factor of real-world situations is orders of magnitude larger.

But also unlike chess, which is highly chaotic (small decisions matter a lot for future state), most small decisions don't matter. If you're flying from NYC to LA, it matters a lot if you drive or fly or walk. It mostly doesn't matter if you walk out the door starting with your left foot or your right. It mostly doesn't matter if you blink now or in two seconds.

My guess is that it's much lower. I'm having a hard time finding a LLM output logit visualizer online, but IIRC, around half of tokens are predicted with >90% confidence. There are regularly more difficult tokens that need to be predicted but the >1% probability tokens aren't so many, probably around 10-20 in most cases.

This is of course based on the outputs of actual models that are only so smart, so a tree search that considers all possibly relevant ideas is going to have a larger amount of branches. Considering how many branches would be pruned to maintain grammatical correctness, my guess is that the token-level branching factor would be around 30. It could be up to around 300, but I highly doubt that it's larger than that.

I wonder if in an application you could branch on something more abstract than tokens. While there might by 50k token branches and 1k of reasonable likelihood, those actually probably cluster into a few themes you could branch off of. For example “he ordered a …” [burger, hot dog, sandwich: food] or [coke, coffee, water: drinks] or [tennis racket, bowling ball, etc: goods].

The article seems rather hand-wavy and over-confident about predicting the future, but it seems worth trying.

"Search" is a generalization of "generate and test" and rejection sampling. It's classic AI. Back before the dot-com era, I took an intro to AI course and we learned about writing programs to do searches in Prolog.

The speed depends on how long it takes to generate a candidate, how long it takes to test it, and how many candidates you need to try. If they are slow, it will be slow.

An example of "human in the loop" rejection sampling is when you use an image generator and keep trying different prompts until you get an image you like. But the loop is slow due to how long it takes to generate a new image. If image generation were so fast that it worked like Google Image search, then we'd really have something.

Theorem proving and program fuzzing seem like good candidates for combining search with LLM's, due to automated, fast, good evaluation functions.

And it looks like Google has released a fuzzer [1] that can be connected to whichever LLM's you like. Has anyone tried it?

[1] https://github.com/google/oss-fuzz-gen

>> Theorem proving and program fuzzing seem like good candidates for combining search with LLM's, due to automated, fast, good evaluation functions.

The problem with that is that search procedures and "evaluation functions" known to e.g. the theorem proving or planning communities are already at the limit of what is theoretically optimal, so what you need is not a new evaluation or search procedure but new maths, to know that there's a reason to try in the first place.

Take theorem proving, as a for instance (because that's my schtick). SLD-Resolution is a sound and complete automated theorem proving procedure for inductive inference that can be implemented by Depth-First Search, for a space-efficient implementation (but is susceptible to looping on left-recursions), or Breadth-First Search with memoization for a time-efficient implementation (but comes with exponential space complexity). "Evaluation functions" are not applicable- Resolution itself is a kind of "evaluation" function for the truth, or you could say the certainty of truth valuations, of sentences in formal logic; and, like I say, it's sound and complete, and semi-decidable for definite logic, and that's the best you can do short of violating Church-Turing. You could perhaps improve the efficiency by some kind of heuristic search (people for example have tried that to get around the NP-hardness of subsumption, an important part of SLD-Resolution in practice) which is where an "evaluation function" (i.e. a heuristic cost function more broadly) comes in, but there are two problems with this: a) if you're using heuristic search it means you're sacrificing completeness, and, b) there are already pretty solid methods to derive heuristic functions that are used in planning (from relaxations of a planning problem).

The lesson is: soundness, completeness, efficiency; choose two. At best a statistical machine learning approach, like an LLM, will choose a different two than the established techniques. Basically, we're at the point where only marginal gains, at the very limits of overall performance can be achieved when it comes to search-based AI. And that's were we'll stay at least until someone comes up with better maths.

I’m wondering how those proofs work and in which problems their conclusions are relevant.

Trying more promising branches first improves efficiency in cases where you guess right, and wouldn’t sacrifice completeness if you would eventually get to the less promising choices. But in the case of something like a game engine, there is a deadline and you can’t search the whole tree anyway. For tough problems, it’s always a heuristic, incomplete search, and we’re not looking for perfect play anyway, just better play.

So for games, that trilemma is easily resolved. And who says you can’t improve heuristics with better guesses?

But in a game engine, it gets tricky because everything is a performance tradeoff. A smarter but slower evaluation of a position will reduce the size of the tree searched before the deadline, so it has to be enough of an improvement that it pays for itself. So it becomes a performance tuning problem, which breaks most abstractions. You need to do a lot of testing on realistic hardware to know if a tweak helped.

And that’s where things stood before AlphaGo came along and was able to train slower but much better evaluation functions.

The reason for evaluation functions is that you can’t search the whole subtree to see if a position is won or lost, so you search part way and then see if it looks promising. Is there anything like that in theorem proving?

Building onto this comment; Terrence Tao, the famous mathematician and big proponent of computer aided theorem proving believes ML will open new avenues in the realm of theorem provers.

Sure, but there are grounded metrics there (the theorem is proved, not proved) that allow feedback. Same for games, almost the same for domains with cheap, approximate evaluators like protein folding (finding the structure is difficult, verifying it quite well is cheap).

For discovery and reasoning??? Not too sure.

>> She was called Leela Chess Zero — ’zero’ because she started knowing only the rules.

That's a common framing but it's wrong. Leela -and all its friends- have another piece of chess-specific knowledge that is indispensable to their performance: they have a representation of the game of chess -a game-world model- as a game tree, divided in plys: one ply for each player's turn. That game tree is what is searched by adversarial search algorithms, such as minimax or Monte Carlo Tree Search (MCTS; the choice of Leela, IIUC).

More precisely modelling a game as a game tree applies to many games, not just chess, but the specific brand of game tree used in chess engines applies to chess and similar, two-person, zero-sum, complete information board games. I do like my jargon! For other kinds of games, different models, and different search algorithms are needed, e.g. see Poker and Libratus [1].

The need for such a game tree, such a model of a game world, is currently impossible to go without, if the target is superior performance. The article mentions no-search algorithms and briefly touches upon their main limitation (i.e. "why?").

All that btw is my problem with the Bitter Lesson: it is conveniently selective with what it considers domain knowledge (i.e. a "model" in the sense of a theory). As others have noted, e.g. Rodney Brooks [2], Convolutional Neural Nets have dominated image classification thanks to the use of convolutional layers to establish positional invariance. That's a model of machine vision invented by a human, alright, just as a game-tree is a model of a game invented by a human, and everything else anyone has ever done in AI and machine learning is the same: a human comes up with a model, of a world, of an environment, of a domain, of a process, then a computer calculates using that model, and sometimes even outperforms humans (as in chess, Go, and friends) or at the very least achieves results that humans cannot match with hand-crafted solutions.

That is a lesson to learn (with all due respect to Rich Sutton). Human model + machine computation has solved every hard problem in AI in the last 80 years. And we have no idea how to do anything even slightly different.

____________________

[1] https://en.wikipedia.org/wiki/Libratus

[2] https://rodneybrooks.com/a-better-lesson/

We haven’t seen algorithms that build world models by observing. We’ve seen hints of it but nothing human like.

It will come eventually. We live in exciting times.

I think I understand the game space that Leela and now Stockfish search. I don't understand whether the author envisions LLMs searching possibility spaces of

  1) written words,
  2) models of math / RL / materials science,
  3) some smaller, formalized space like the game space of chess,

all of the above, or something else. Did I miss where that was clarified?

He wants the search algorithm to be able to search for better search algorithms, i.e. self-improving. That would eliminate some of the narrower domains.

The biggest issue the author does not seem aware of is how much compute is required for this. This article is the equivalent of saying that a monkey given time will write Shakespeare. Of course it's correct, but the search space is intractable. And you would never find your answer in that mess even if it did solve it.

I've been building branching and evolving type llm systems for well over a year now full time.

I have built multiple "search" or "exploring" algorithms. The issue is that after multiple steps, your original agent, who was tasked with researching or doing biology, is now talking about battleships (an actual example from my previous work).

Single step is the only real situation search functions work. Mutli step agents explode to infinite possibilities very very quickly.

Single step has its own issues, though. While a zero shot question run 1000 times (eg, solve this code problem), may help find a better solution it's a limited search space (which is a good thing)

I recently ran a test of 10k inferences of a single input prompt on multiple llm models varying the input configurations. What you find is that an individual prompt does not have infinite response possibilities. It's limited. This is why they can actually function as llms now.

Agents not working is an example of this problem. While a single step search space is massive, it's exponential every step the agent takes.

I'm building tools and systems around solving this problem, and to me, a massive search is as far off as saying all we need 100x AI model sizes to solve it.

Autonomy =/ (Intelligence or reasoning)

Before an LLM discovers a cure for cancer, I propose we first let it solve the more tractable problem of discovering the “God Cheesecake” - the cheesecake do delicious that a panel of 100 impartial chefs judges to be the most delicious they have ever tasted. All the LLM has to do is intelligently search through the much more combinatorially bounded “cheesecake space” until it finds this maximally delicious cheesecake recipe.

But wait… An LLM can’t bake cheesecakes, nor if it could would it be able to evaluate their deliciousness.

Until AI can solve the “God Cheesecake” problem, I propose we all just calm down a bit about AGI

Heck, even theoretically 100% within the limitations of an LLM executing on a computer, it would be world changing if LLMs could write a really, really good short story or even good advertising copy.

They are getting better and better. I am fairly sure the short stories and advertising copy you can produce by pushing current techniques harder will also improve.

I don't know whether current techniques will be enough to 'write a really, really good short story', but I'm willing to bet we'll get there soon enough (whether that'll involve new techniques or not).

I mean... does anyone think that an LLM-assisted program to trial and error cheesecake recipes to a panel of judges wouldn't result in the best cheesecake of all time..?

The baking part is robotics, which is less fair but kinda doable already.

> I mean... does anyone think that an LLM-assisted program to trial and error cheesecake recipes to a panel of judges wouldn't result in the best cheesecake of all time..?

Yes, because different people like different cheesecakes. “The best cheesecake of all time” is ill-defined to begin with; it is extremely unlikely that 100 people will all agree that one cheesecake recipe is the best they’ve ever tasted. Some people like a softer cheesecake, some firmer, some more acidic, some creamier.

Setting that problem aside—assuming there exists an objective best cheesecake, which is of course an absurd assumption—the field of experimental design is about a century old and will do a better job than an LLM at honing in on that best cheesecake.

What would be interesting, is a system that makes some measurement of a person (eg analyse a video of them eating different cheesecakes or talking about their tastes or whatever), and bakes the best cheesecake specifically for them.

Then you can get a panel of 100 people and bake 100 cheesecakes for them.

OT but this website completely breaks arrow and page up/down scrolling, as well as alt=arrow navigation. Only mouse scrolling works for me (I'm using Firefox). Can't websites stop messing with basic browser functionality for no valid reason at all ?

The author is making a few leap of faith in this article.

First, his example of the efficiency of ML+search for playing Chess is interesting but not a proof that this strategy would be applicable or efficient in the general domain.

Second, he is implying that some next iteration of ChatGPT will reach AGI level, given enough scale and money. This should be considered hypothetical until proven.

Overall, he should be more scientific and prudent.

I believe in search, but it only works if you have an appropriate search space. Chess has a well-defined space but the everyday world does not. The trick is enabling an algorithm to learn its own search space through active exploration and reading about our world. I'm working on that.

that's interesting; are you building a sort of 'digital twin' of the world it's explored, so that it can dream about exploring it in ways that are too slow or dangerous to explore in reality?

The goal is to enable it to model the world at different levels of abstraction based on the question it wants to answer. You can model car as an object that travels fast and carries people, or you can model it down to the level of engine parts. The system should be able to pick the level of abstraction and put the right model together based on its goals.

so then you can search over configurations of engine parts to figure out how to rebuild the engine? i may be misunderstanding what you're doing

Yeah, you could. Or you could search for shapes of different parts that would maximize the engine efficiency. The goal is to simultaneously build a representation space and a simulator so that anything that could be represented could be simulated.

Have you written about this anywhere?

I’m also very interested in this.

I’m at the stage where I’m exploring how to represent such a model/simulator.

The world isn’t brittle, so representing it as a code / graph probably won’t work.

It seems there is a fundamental information theory aspect to this that would probably save us all a lot of trouble if we would just embrace it.

The #1 canary for me: Why does training an LLM require so much data that we are concerned we might run out of it?

The clear lack of generalization and/or internal world modeling is what is really in the way of a self-bootstrapping AGI/ASI. You can certainly try to emulate a world model with clever prompting (here's what you did last, heres your objective, etc.), but this seems seriously deficient to me based upon my testing so far.

In my experience, LLMs do a very poor job of generalizing. I have also seen self supervised transformer methods usually fail to generalize in my domain (which includes a lot of diversity and domain shifts). For human language, you can paper over failure to generalize by shoveling in more data. In other domains, that may not be an option.

It’s exactly what you would expect from what an LLM is. It predicts the next word in a sequence very well. Is that how our brains, or even a bird’s brain, for that matter, approach cognition? I don’t think that’s how any animals brain works at all, but that’s just my opinion. A lot of this discussion is speculation. We might as well all wait and see if AGI shows up. I’m not holding my breath.

At a certain high level I’m sure you can model the brain that way. But we know humans are neuroplastic, and through epigenetics it’s possible that learning in an individual’s life span will pass to their offspring. Which means human brains have been building internal predictive models for billions of years over innumerable individual lifespans. The idea that we’re anywhere close to replicating that with a neural net is completely preposterous. And besides my main point was that our brains don’t think one word at a time. I’m not sure how that relates to predictive processing.

Couldn’t agree more. For specific applications like drug development where you have a constrained problem with fixed set of variables and a well defined cost function I’m sure the chess analogy will hold. But I think there a core elements of cognition missing from chatGPT that aren’t easily built.

The problem is the transitive closure of chess move is a chess move. The transitive closure of human knowledge and theories to do X is new theories never seen before and no Value function can do that, unless you are also implying theorem proving is included for correctness verification which is also a very difficult search and computationally expensive problem on its own.

Also, I think this is instead time to sit back and think what exactly is the thing we value in society as well: Personal(Human) self-sufficiency(I also like to compare this AI to UBI) and thus achievement, which only means Human-in-Loop AI that can help us achieve that and that is specific to each individual, i.e. multi-atttribute value functions whose weights are learned and they change over time.

Writing about AGI and defining it to do the "best" search while not talking about what we want it to do *for us* is exactly wrong-headed for these reasons.

If I had to bet money on it, researchers at top labs have already tried applying search to existing models. The idea to do so is pretty obvious. I don't think it's the one key insight to achieve AGI as the author claims.

This is a fascinating idea - although I wish the definition of search in the LLM context was expanded a bit more. What kind of search capability strapped onto current-gen LLMs would give them superpowers?

I think what may be confusing is that the author is using "search" here in the AI sense, not in the Google sense: that is, having an internal simulator of possible actions and possible reactions, like Stockfish's chess move search (if I do A, it could do B C or D; if it does B, I can do E F or G, etc).

So think about the restrictions current LLMs have:

* They can't sit and think about an answer; they can "think out loud", but they have to start talking, and they can't go back and say, "No wait, that's wrong, let's start again."

* If they're composing something, they can't really go back and revise what they've written

* Sometimes they can look up reference material, but they can't actually sit and digest it; they're expected to skim it and then give an answer.

How would you perform under those circumstances? If someone were to just come and ask you any question under the sun, and you had to just start talking, without taking any time to think about your answer, and without being able to say "OK wait, let me go back"?

I don't know about you, but there's no way I would be able to perform anywhere close to what ChatGPT 4 is able to do. People complain that ChatGPT 4 is a "bullshitter", but given its constraints that's all you or I would be in the same situation -- but it's already way, way better than I could ever be.

Given its limitations, ChatGPT is phenomenal. So now imagine what it could do if it were given time to just "sit and think"? To make a plan, to explore the possible solution space the same way that Stockfish does? To take notes and revise and research and come back and think some more, before having to actually answer?

Reading this is honestly the first time in a while I've believed that some sort of "AI foom" might be possible.

> They can't sit and think about an answer; they can "think out loud", but they have to start talking, and they can't go back and say, "No wait, that's wrong, let's start again."

I mean, technically, they could say that.

Llama 3 does, it's a funny design now, if you also throw in training to encourage CoT. Maybe more correct but verbosity can be grating

CoT answer Wait! No, that's not right: CoT...

"How would you perform under those circumstances?" My son would recommend Improv classes.

"Given its limitations, ChatGPT is phenomenal." But this doesn't translate since it learned everything from data and there is no data on "sit and think".

[1] applied AlphaZero style search with LLMs to achieve performance comparable to GPT-4 Turbo with a llama3-8B base model. However, what's missing entirely from the paper (and the subject article in this thread) is that tree search is massively computationally expensive. It works well when the value function enables cutting out large portions of the search space, but the fact that the LLM version was limited to only 8 rollouts (I think it was 800 for AlphaZero) implies to me that the added complexity is not yet optimized or favorable for LLMs.

[1] https://arxiv.org/abs/2406.07394

The whole premise of this article is to compare the chess state of the art of 2019 with today, and then they start to talk about llms. But chess is a board with 64 squares and 32 pieces, it's literally nothing compared to the real physical world. So I don't get how this is relevant

That’s a good point. Imagine if an LLM could only read, speak, and hear at the same speed as a human. How long would training a model take?

We can make them read digital media really quickly, but we can’t really accelerate its interactions with the physical world.

A big problem with the conclusions of this article is the assumptions around possible extrapolations.

We don't know if a meaningfully superintelligent entity can exist. We don't understand the ingredients of intelligence that well, and it's hard to say how far the quality of these ingredients can be improved, to improve intelligence. For example, an entity with perfect pattern recognition ability, might be superintelligent, or just a little smarter than Terrance Tao. We don't know how useful it is to be better at pattern recognition to an arbitrary degree.

A common theory is that the ability modeling processes, like the behavior of the external world is indicative of intelligence. I think it's true. We also don't know the limitations of this modeling. We can simulate the world in our minds to a degree. The abstractions we use make the simulation more efficient, but less accurate. By this theory, to be superintelligent, an entity would have to simulate the world faster with similar accuracy, and/or use more accurate abstractions.

We don't know how much more accurate they can be per unit of computation. Maybe you have to quadruple the complexity of the abstraction, to double the accuracy of the computation, and human minds use a decent compromise that is infeasible to improve by a large margin. Maybe generating human level ideas faster isn't going to help because we are limited by experimental data, not by the ideas we can generate from it. We can't safely assume that any of this can be improved to an arbitrary degree.

We also don't know if AI research would benefit much from smarter AI researchers. Compute has seemed to be the limiting factor at almost all points up to now. So the superintelligence would have to help us improve compute faster than we can. It might, but it also might not.

This article reminds me of the ideas around the singularity, by placing too much weight on the belief that any trendline can be extended forever.

It is otherwise pretty interesting, and I'm excitedly watching the 'LLM + search' space.

The branching factor for chess is about 35.

For token generation, the branching factor depends on the tokenizer, but 32,000 is a common number.

Will search be as effective for LLMs when there are so many more possible branches?

You can pretty reasonably prune the tree by a factor of 1000... I think the problem that others have brought up - difficulty of the value function - is the more salient problem.

I don't run javascript by default using NoScript, and something amusing happened on this website because of it.

The link for the site points to a notion.site address, but attempting to go to this address without javascript enabled (for that domain) forces a redirect to a notion.so domain. Attempting to visit just the basic notion.site address also does this same redirection.

What this ends up causing is that I don't have an easy way to use NoScript to temporarily turn on javascript for the notion.site domain, because it never loads. So much for reading this article.

This feels a lot like generation 3 AI throwing out all the insights from gens 1 and 2 and then rediscovering them from first principles, but it’s difficult to tell what this text is really about because it lumps together a lot of things into “search” without fully describing what that means more formally.

Isn't the "search" space infinite though and impossible to qualify "success"?

You can't just give LLMs infinite compute time and expect them to find answers for like "cure cancer". Even chess moves that seem finite and success quantifiable are an also infinite problem and the best engines take "shortcuts" in their "thinking". It's impossible to do for real world problems.

the recent episode of Machine Learning Street Talk on control theory for LLMs sounds like it's thinking in this direction. Say you have 100k agents searching through research papers, and then trying every combination of them, 100k^2, to see if there is any synergy of ideas, and you keep doing this for all the successful combos... some of these might give the researchers some good ideas to try out. I can see it happening, if they can fine tune a model that becomes good at idea synergy. but then again real creativity is hard

"In 2019, a team of researchers built a cracked chess computer. She was called Leela Chess Zero — ’zero’ because she started knowing only the rules. She learned by playing against herself billions of times"

This is a gross historical misunderstanding or misrepresentation.

Google accomplished this feat. Then the OS+academics reversed engineered/duplicated the study

How would search + LLMs work together in practice?

How about using search to derive facts from ontological models, and then writing out the discovered facts in English. Then train the LLM on those English statements. Currently LLMs are trained on texts found on the internet mostly (only?). But information on the internet is often false and unreliable.

If instead we would have logically sound statements by the billions derived from ontological world-models then that might improve the performance of LLMs significantly.

Is something like this what the article or others are proposing? Give the LLM the facts, and the derived facts. Prioritize texts and statements we know and trust to be true. And even though we can't write out too many true statements ourselves, a system that generated them by the billions by inference could.

I wouldn’t read too much into stockfish beating Leela Chess Zero. My calculator beats GPT-4 in matrix multiplication, doesn’t mean we need to do what my calculator does in GPT-4 to make it smarter. Stockfish evaluates 70 million moves per second (or something in that ballpark). Chess is not such a complicated game that you aren’t guaranteed to find the best move when you evaluate 70 million moves. It’s why when there was an argument whether alpha zero really beat stockfish convincingly in Google’s PR Stunt, a notable chess master quipped “Even god would not be able to beat stockfish this frequently.” , similarly god with all this magical powers would not beat my calculator at multiplication. It says more about the task than about the nature of intelligence.

People vastly underestimate god. Players aren't just trying not to blunder, they're trying to steer towards advantageous positions. Stockfish could play perfectly against itself every move 100 games in a row, in the classical sense of perfectly, as not in any move blundering the draw, and still be reliably exploited by an oracle.

Charlie Steiner pointed this out 5 years ago on Less Wrong:

>If you train GPT-3 on a bunch of medical textbooks and prompt it to tell you a cure for Alzheimer's, it won't tell you a cure, it will tell you what humans have said about curing Alzheimer's ... It would just tell you a plausible story about a situation related to the prompt about curing Alzheimer's, based on its training data. Rather than a logical Oracle, this image-captioning-esque scheme would be an intuitive Oracle, telling you things that make sense based on associations already present within the training set.

>What am I driving at here, by pointing out that curing Alzheimer's is hard? It's that the designs above are missing something, and what they're missing is search. I'm not saying that getting a neural net to directly output your cure for Alzheimer's is impossible. But it seems like it requires there to already be a "cure for Alzheimer's" dimension in your learned model. The more realistic way to find the cure for Alzheimer's, if you don't already know it, is going to involve lots of logical steps one after another, slowly moving through a logical space, narrowing down the possibilities more and more, and eventually finding something that fits the bill. In other words, solving a search problem.

>So if your AI can tell you how to cure Alzheimer's, I think either it's explicitly doing a search for how to cure Alzheimer's (or worlds that match your verbal prompt the best, or whatever), or it has some internal state that implicitly performs a search.

https://www.lesswrong.com/posts/EMZeJ7vpfeF4GrWwm/self-super...

Generalizing this (doing half a step away from GPT-specifics), would it be true to say the following?

"If you train your logic machine on a bunch of medical textbooks and prompt it to tell you a cure for Alzheimer's, it won't tell you a cure, it will tell you what those textbooks have said about curing Alzheimer's."

Because I suspect not. GPT seems mostly limited to regurgitating+remixing what it read, but other algorithms with better logic could be able to essentially do a meta study: take the results from all Alzheimer's experiments we've done and narrow down the solution space to beyond what humans achieved so far. A human may not have the headspace to incorporate all relevant results at once whereas a computer might

Asking GPT to "think step by step" helps it, so clearly it has some form of this necessary logic, and it also performs well at "here's some data, transform it for me". It has limitations in both how good its logic is and the window across which it can do these transformations (but it can remember vastly more data from training than from the input token window, so perhaps that's a partial workaround). Since it does have both capabilities, it does not seem insurmountable to extend it: I'm not sure we can rule out that an evolution of GPT can find Alzheimer's cure within existing data, let alone a system even more suited to this task (still far short of needing AGI)

This requires the data to contain the necessary building blocks for a solution, but the quote seems to dismiss the option altogether even if the data did contain all information (but not yet the worked-out solution) for identifying a cure

“Search” here means trying a bunch of possibilities and seeing what works. Like how a sudoku solver or pathfinding algorithm does search, not how a search engine does.

But the domain of “AI Research” is broad and imprecise - not simple and discrete like chess game states. What is the type of each point in the search space for AI Research?

In chess we know how to describe all possible board states and the transitions (the next moves). We just don’t know which transition is the best to pick, hence it’s a well defined search problem.

With AI Research we don’t even know the shape of the states and transitions, or even if that’s an appropriate way to think about things.

In this context, RAG isn't what's being discussed. Instead, the reference is to a process similar to monte carlo tree search, such as that used in the AlphaGo algorithm.

Presently, a large language model (LLM) uses the same amount of computing resources for both simple and complex problems, which is seen as a drawback. Imagine if an LLM could adjust its computational effort based on the complexity of the task. During inference, it might then perform a sort of search across the solution space. The "search" mentioned in the article means just that, a method of dynamically managing computational resources at the time of testing, allowing for exploration of the solution space before beginning to "predict the next token."

At OpenAI Noam Brown is working on this, giving AI the ability to "ponder" (or "search"), see his twitter post: https://x.com/polynoamial/status/1676971503261454340

Given the example of Pfizer in the article, I would tend to agree with you that ‘search’ in this context means augmenting GPT with RAG of domain specific knowledge.

slight step aside - do people at notion realize, their own custom keyboard shortcuts break the habits built on the web.

cmd + p -- bring up their own custom dialog. instead of printing the page as one would expect

While I respect the power of intuition - this may well be a great path - it's worth keeping in mind that this is currently just that. A hunch. Leela got crushed due to AI directed search, what if we can wave a wand and hand all AIs search. Somehow. Magically. Which will then somehow magically trounce current LLMs at domain-specific task.

There's a kernel of truth in there. See the papers on better results via monte carlo search trees (e.g. [1]). See mixture-of-LoRA/LoRA-swarm approaches. (I swear there's a startup using the approach of tons of domain-specific LoRAs, but my brain's not yielding the name)

Augmenting LLM capabilities via _some_ sort of cheaper and more reliable exploration is likely a valid path. It's not GPT-8 next year, though.

[1] https://arxiv.org/pdf/2309.03224

I've recently matured to the point where all applications are made of 2 things, search and security. The rest is just things added on top. If you cant find it it isn't worth having.

Anything that allows AI to scale to superinteligence quicker is going to run into AI alignment issues, since we don’t really know a foolproof way of controlling AI. With the AI of today, this isn’t too bad (the worst you get is stuff like AI confidently making up fake facts), but with a superintelligence this could be disastrous.

It’s very irresponsible for this article to advocate and provide a pathway to immediate superintelligence (regardless of whether or not it actually works) without even discussing the question of how you figure out what you’re searching for, and how you’ll prevent that superintelligence from being evil.

Of course "superintelligence" is just a mythical creature at the moment, with no known path to get there, or even a specific proof of what it even means - usually it's some hand waving about capabilities that sound magical, when IQ might very well be subject to diminishing returns.

Do you mean no way to get there within realistic computation bounds? Because if we allow for arbitrarily high (but still finite) amounts of compute, then some computable approximation of AIXI should work fine.

>Do you mean no way to get there within realistic computation bounds?

I mean there's no well defined "there" either.

It's a hand-waved notion that adding more intelligence (itself not very well defined, but let's use IQ) you get to something called "hyperintelligence", say IQ 1000 or IQ 10000, that has what can be described as magical powers, like it can convince any person to do anything, can invent things at will, huge business success, market prediction, and so on.

Whether intelligence is cummulative like that, or whether having it gets you those powers (aside from the succesful high IQ people, we know many people with IQ 145+ that are not inventing stuff left and right, or convincing people with some greater charisma than the average IQ 100 or 120 politician, but e.g. are just sad MENSA losers, whose greatest achievement is their test scores).

>Because if we allow for arbitrarily high (but still finite) amounts of compute, then some computable approximation of AIXI should work fine.

I doubt that too. The limit for LLMs for example is more human produced training data (a hard limit) than compute.

> itself not very well defined, but let's use IQ

IQ has an issue that is inessential to the task at hand, which is how it is based on a population distribution. It doesn’t make sense for large values (unless there is a really large population satisfying properties that aren’t satisfied).

> I doubt that too. The limit for LLMs for example is more human produced training data (a hard limit) than compute.

Are you familiar with what AIXI is?

When I said “arbitrarily large”, it wasn’t for laziness reasons that I didn’t give an amount that is plausibly achievable. AIXI is kind of goofy. The full version of AIXI is uncomputable (it uses a halting oracle), which is why I referred to the computable approximations to it.

AIXI doesn’t exactly need you to give it a training set, just put it in an environment where you give it a way to select actions, and give it a sensory input signal, and a reward signal.

Then, assuming that the environment it is in is computable (which, recall, AIXI itself is not), its long-run behavior will maximize the expected (time discounted) future reward signal.

There’s a sense in which it is asymptotically optimal across computable environments (... though some have argued that this sense relies on a distribution over environments based on the enumeration of computable functions, and that this might make this property kinda trivial. Still, I’m fairly confident that it would be quite effective. I think this triviality issue is mostly a difficulty of having the right definition.)

(Though, if it was possible to implement practically, you would want to make darn sure that the most effective way for it to make its reward signal high would be for it to do good things and not either bad things or to crack open whatever system is setting the reward signal in order for it to set it itself.)

(How it works: AIXI basically enumerates through all possible computable environments, assigning initial probability to each according to the length of the program, and updating the probabilities based on the probability of that environment providing it with the sequence of perceptions and reward signals it has received so far when the agent takes the sequence of actions it has taken so far. It evaluates the expected values of discounted future reward of different combinations of future actions based on its current assigned probability of each of the environments under consideration, and selects its next action to maximize this. I think the maximum length of programs that it considers as possible environments increases over time or something, so that it doesn’t have to consider infinitely many at any particular step.)

>AIXI doesn’t exactly need you to give it a training set, just put it in an environment where you give it a way to select actions, and give it a sensory input signal, and a reward signal.

That's still a training set, just by another name.

And with the environment being the world we live in, it would be constrained by the local environment's possible states, the actions it can perform to get feedback on, and the rate of environment's response (the rate of feedback).

Add the quick state-space inflation in what it is considering, and it's an even tougher deal than getting more training data for an LLM.

Hey! Essay author here.

>The cool thing about using modern LLMs as an eval/policy model is that their RLHF propagates throughout the search.

>Moreover, if search techniques work on the token level (likely), their thoughts are perfectly interpretable.

I suspect a search world is substantially more alignment-friendly than a large model world. Let me know your thoughts!

Your webpage is broken for me. The page appears briefly, then there's a french error message telling me that an error occured and i can retry.

Mobile Safari, phone set to french.

I don't think your response is appropriate. Narrow domain "superintelligence" is around us everywhere-- every PID controller can drive a process to its target far beyond any human capability.

The obvious way to incorporate good search is to have extremely fast models that are being used in the search interior loop. Such models would be inherently less general, and likely trained on the specific problem or at least domain-- just for performance sake. The lesson in this article was that a tiny superspecialized model inside a powerful transitional search framework significantly outperformed a much larger more general model.

Use of explicit external search should make the optimization system's behavior and objective more transparent and tractable than just sampling the output of an auto-regressive model alone. If nothing else you can at least look at the branches it did and didn't explore. It's also a design that's more easy to bolt in varrious kinds of regularizes, code to steer it away from parts of the search space you don't want it operating in.

The irony of all the AI scaremongering is that if there is ever some evil AI with some LLM as an important part of its reasoning process if it is evil it may well be so because being evil is a big part of the narrative it was trained on. :D

The problem with adding "search" to a model is that the model has already seen everything to be "search"ed in its training data. There is nothing left.

Imagine if Leela (author's example) had been trained on every chess board position out there (I know it's mathematically impossible, but bear with me for a second). If Leela had been trained on every board position, it may have whupped Stockfish. So, adding "search" to Leela would have been pointless, since it would have seen every board position out there.

Today's LLMs are trained on every word ever written on the 'net, every word ever put down in a book, every word uttered in a video on Youtube or a podcast.

If the game was small enough to memorize, like tic tac toe, you could definitely train a neural net to 100% accuracy. I've done it, it works.

The problem is that for most of the interesting problems out there, it isn't possible to see every possibility let alone memorize it.

you are making the mistake of thinking that 'search' means database search, like google or sqlite, but 'search' in the ai context means tree search, like a* or tabu search. the spaces that tree search searches are things like all possible chess games, not all chess games ever played, which is a smaller space by a factor much greater than the number of atoms in the universe

You're omitting the somewhat relevant part of recall ability. I can train a 50 parameter model on the entire internet, and while it's seen it all, it won't be able to recall it. (You can likely do the same thing with a 500B model for similar results, though it's getting somewhat closer to decent recall)

The whole point of deep learning is that the model learns to generalize. It's not to have a perfect storage engine with a human language query frontend.

Fully agree, although it’s interesting to consider the perspective that the entire LLM hype cycle is largely built around the question “what if we punted on actual thinking and instead just tried to memorize everything and then provide a human language query frontend? Is that still useful?” Arguably it is (sorta), and that’s what is driving this latest zeitgeist. Compute had quietly scaled in the background while we were banging our heads against real thinking, until one day we looked up and we still didn’t have a thinking machine, but it was now approximately possible to just do the stupid thing and store “all the text on the internet” in a lookup table, where the keys are prompts. That’s… the opposite of thinking, really, but still sometimes useful!

Although to be clear I think actual reasoning systems are what we should be trying to create, and this LLM stuff seems like a cul-de-sac on that journey.

The thing is that current chat tools forgo the source material. A proper set of curated keywords can give you a less computational intensive search.

（评论） (comments)

（评论）
(comments)