（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=39955725

目前认为大型语言模型（LLM）在实现通用人工智能（AGI）方面受到限制。相反，在随意的谈话中，大多数人的行为与法学硕士相似，表现出陌生人可以预测的模式和反应。法学硕士的功能就像数学算法一样，接受输入并产生输出，缺乏有意识的经验或情感。尽管法学硕士取得了进步，但由于模拟和仿真之间的区别以及定义智能的困难，其智能仍然存在不确定性。此外，人类经常过度简化复杂的系统，包括法学硕士，作为一种防御机制或对清晰度的渴望。这些信念通常源于这样的信念：法学硕士缺乏个人理解，需要不断的指导。然而，即使是复杂的人类也会表现出不一致的推理，这表明智力不仅仅是确定性或可预测的。最终，持续的争论围绕着智力的定义和期望，而不是法学硕士的固有局限性。

I'm not sure people in these comments are reading this paper correctly.

This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.

Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer. (Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic)

And this simple algorithm preform just as well (and sometimes better) than all the other multi-agent algorithms.

This suggests that the other multi-agent schemes with their clever prompts aren't really doing anything special; Their improve results are coming mostly from the fact that the LLM is run multiple times, that the prompt asks the LLM to pick the best answer.

>> Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer.

https://en.wikipedia.org/wiki/Lorenz_system

Years ago weather simulations started tweaking input params and running their models over and over. Discarding outliers, taking averages. It works pretty well.

Because LLM's mostly have random seeds (aka temperature) feeding them the same input and averaging the output is going to get you a better guess.

Lorenz also gives some clues (if not an outright explanation) as to why the "hallucination" problem is likely unsolvable.

If you buy into this line of thinking then it quickly becomes apparent that LLM's are more or less a dead end when it comes to AGI. Simulating isnt emulating... an LLM is as likely to become intelligent as a forecast is to control the weather.

> it quickly becomes apparent that LLM's are more or less a dead end when it comes to AGI.

On the contrary, sit and listen in a college cafeteria, and it quickly becomes apparent most conversation participants are LLMs.*

> Simulating isnt emulating...

These are not synonyms, true.

> an LLM is as likely to become intelligent as a forecast is to control the weather.

I don't see uncertainty of intelligence as a property of an LLM as being equivalent to certainty of weather control as a effect of a forecast.

Among other things, whether weather was controlled would tend to be agreed by all observers, while it's often unclear if intelligence is being observed in these threads. :-)

---

* While my last line was a joke, humans in LLM mode was not. We can drive on autopilot, and get where we need to go while not being able to remember how we got there. We definitely converse on autopilot, indistinguishably from LLMs talking to each other, after an opening line every word of every sentence in the entire exchange perfectly predictable to a stranger. Are the speakers intelligent? What about the stranger who knows what they will say next? To say LLMs are not intelligent is easier if we agree humans spend a good deal of time being unintelligent.

LLMs were specifically trained to emulate human interaction patterns. Of course we sound like them at times. It's the things we can do that they can't that are relevant.

If I study Einstein and learn to do a really good impression, the statement "Einstein often sounds like karmacondon" will be true. That does not make me Einstein.

> If I study Einstein and learn to do a really good impression, the statement "Einstein often sounds like karmacondon" will be true.

Wrong alt, hooande ;)

>>> I don't see uncertainty of intelligence as a property of an LLM as being equivalent to certainty of weather control as a effect of a forecast.

GTA 5 is a simulation. Do you expect to be arrested out side your front door for the car you stole in game?

Weather forecasting is a simulation, it tells you what the weather will look like in the next few days. It gets better as we get more sensors, collect more data and build more accurate models based on those two factors. It will never leap to weather.

Language forecasting (because this is what an LLM is) is a simulation. It tells you what the next token (word) will be based on what came before it. It gets better as we collect more data and hone and refine these models. It will never make the leap to intelligence.

>> To say LLMs are not intelligent is easier if we agree humans spend a good deal of time being unintelligent.

To say that LLMs are intelligent means that language is a requirement for intelligence. Thats some fairly magical thinking... buy any sufficently advanced technology...

Intelligence breaks the pattern here. A simulated intelligence is intelligent, just as simulated math is math and simulated computers are computers. The point of contention shouldn't be whether LLMs are intelligences or simulated intelligences, but whether they're simulating something else.

Right. This is Searle's "a simulated plane won't get you to Japan" argument.

That's true. But a simulated calculator is perfectly effective for doing your taxes.

Like Searle’s Chinese Room argument [0]?

I think a challenge with the simulated-is-real math/calculator argument is that the simulation operates syntactically thru derivation without meaning.

E.g. a simulation of ZF set theory cannot tell you the truth value of the Axiom of Choice - because it’s independent of the ZF axioms (it is undecidable in the Gödel incompleteness sense).

But “Although originally controversial, the axiom of choice is now used without reservation by most mathematicians” [1] - I guess it’s truth is self-evident semantically.

So because of incompleteness, simulated math/calc will always be “missing” something.

Of course a LLM will happily say A of C is true (or not) but is it just parroting from the dataset or hallucinating?

[0]: https://plato.stanford.edu/entries/chinese-room/

[1]: https://en.m.wikipedia.org/wiki/Axiom_of_choice

Not sure if it counts but there is a police chase video online some place with a guy on drugs who claims he thought he was playing gta. The way he throws people out of their vehicle and crashes their car suggests he wasnt lying.

> Language forecasting (because this is what an LLM is) is a simulation. It tells you what the next token (word) will be based on what came before it. It gets better as we collect more data and hone and refine these models. It will never make the leap to intelligence.

Due to quantum theory and chaos theory it is impossible to simulate any system to 100%. Yet, this does not mean it is impossible to design intelligent systems which are indistinguishable from their 'real' counterparts. Maybe we are at the level where a fly can be simulated accurately enough to make a distinction moot, maybe we have enough compute to simulate a mouse. We will get to a point where we can simulate a human brain. It will be indistinguishable from intelligence. I don't think the methodology really matters. In the end everything is compute.

> To say that LLMs are intelligent means that language is a requirement for intelligence. Thats some fairly magical thinking... buy any sufficently advanced technology..

When I was a kid, it was the definition of intelligence that separated humans from animals.

And there's a reason "dumb" means "mute" and independently "stupid".

It may well be an incorrect requirement. It may be a single form of intelligence out of many which happen to correlate in humans, but not in minds created by artifice.

But it does have a history.

> We definitely converse on autopilot, indistinguishably from LLMs talking to each other, after an opening line every word of every sentence in the entire exchange perfectly predictable to a stranger

Some people report speaking like this: opening their mouths and not knowing how the sentence will end.

I don't experience that, I think.

Possibly used to? I have in the past had some autonomous verbal responses, for a bit this included echoing greetings — great when it's "hello", embarrassing when it's "happy birthday".

> To say LLMs are not intelligent is easier if we agree humans spend a good deal of time being unintelligent

Kinda; System 1, system 2 — the best LLMs do better than most people's system 1, worse than most people's system 2. Bat and ball, $1.10.

> On the contrary, sit and listen in a college cafeteria, and it quickly becomes apparent most conversation participants are LLMs.*

Excuse the bluntness, but you're the CTO of a fintech company. Your analysis of people's social life is probably the as valuable as a janitors.

Why is it so important to you that everyone recognizes this intelligence? What is at stake in your mind here?

This impulse towards reductivism/behaviorism in order to defend the LLMs is still profoundly interesting. It always ends up feeling like the person wants to be like an LLM, not the other way around. I think people feel lost in a deep way, and this line of thought becomes deeply comforting.

Like, so many people it seems want the future and themselves to become comprehensible all at once. "Why worry so much about myself? Im just a stochastic parrot like an LLM anyway.. Attention is all I need!"

I get it, life is hard. But we need to keep the dream alive. You gotta hope for better.

All this makes the future sound do dull. Like I am gonna wake up one day and all pizza will be shitty, tasteless pizza, but everyone will tell me: "well really look at it, it has cheese, sauce, toppings... Its pizza! You can eat it."

> If you buy into this line of thinking then it quickly becomes apparent that LLM's are more or less a dead end when it comes to AGI. Simulating isnt emulating... an LLM is as likely to become intelligent as a forecast is to control the weather

Up until this point, I agree.

This puts humans on too high a pedestal: LLMs aren't magic, and we're not magic either.

(There's other reasons for me to think Transformers aren't the answer, but not this kind of reasoning).

The weather isn’t magic either. It’s produced by physical mechanisms. But everyone would probably agree that a model simulating some rough aggregate of those mechanisms isn’t “weather” itself.

On the other hand. Take that weather model and render its output into a stereoscopic 3D world with photorealistic particle systems and whatever. To someone wearing a Vision Pro or similar high-def VR headset, the model is now “the weather” in the system their senses occupy. It’s missing a lot of actual sensory cues — the rain isn’t wet, the wind won’t chill your skin, and so on. But it’s close enough for some convincing applications. A caveman with no experience with technology would undoubtedly believe himself transported into a different world with real weather.

LLMs are a bit like that now. Their simulation abilities took such a sudden leap, we’re like cavemen wearing headsets.

The only way I can model what you're trying to say, is if I assume you think "the mind" is a separate kind of substance, and not merely information processing that just happens to be implemented on biological electrochemistry in our skulls.

A (philosophical) dualist can easily say that no computation is ever intelligent. I don't think this can ever be said by a (philosophical) materialist.

Even if from a technical perspective you're right, I think people need to be careful with the "x is not special" talk. It is a put down and it's how things like human and animal rights get obliterated and how the environment gets ruined.

"Trees aren't special", "Dolphins aren't special", "Koala's suck, let's put a mine here instead", "Pigs don't have emotions or are dumb, so it's fine to factory farm" etc.

I don't get the argument. I don't think something being magic will stop humans from exploiting it. At the end of the day intelligent people are great at coming up with excuses as to why they should do something bad. "Just chop that one tree down, its in the wrong place anyway" "Just kill that one dolphin, its old anyway" when taken together these add up to bad outcomes we dislike. Much better to discourage / fine / ban all tree chopping and dolphin killing and let select professionals remove sick trees and dolphins.

Indeed. But I said "X is not magic", rather than "X is not special" — until we have an answer to the hard problem of consciousness (or agree which of the 40 definitions of the word "consciousness" we're using when discussing if an AI has it), we can't possibly determine if an LLM has it or not.

(My gut feeling says "LLMs are not conscious", but my gut has had a lot of false beliefs over the years as well as correct ones, so I give it a corresponding level of trust).

Fair enough then. I sort of use the terms interchangeably in this context.

When you think about it, a bird is “magic” in the sense there is a whole universe and eco system to give that bird the platform for existence. A real living bird isn’t just a concept.

So sometimes I wonder if we just say we’re insignificant because it’s a simpler way to think. It makes the idea of death and loss easier to bear.

If I tell myself I’m just a spec of dust and that I’m bit special, it can be quite comforting.

Conceptually we understand things about how birds work but the fact there is a blob of millions or billions of cells functioning to produce a bird, which can fly, completely autonomously is quite peculiar and there is a type of magic or wonder to it all which makes me think birds are both special and magic if you think differently about existence and not just the intellectual concept of a bird.

My gut feeling is that consciousness isn’t as deep and mysterious as people think it is. It’s possible that consciousness is an inevitable result of putting a sufficiently intelligent mind into a body and, as a result, the mind can’t help but weave a story about itself that connects events together.

Similarly with other properties of intelligence and the brain that we like to think are mysterious and deep.

> we're not magic either

We pretty much are compared to present-day neural architectures. How many simulated neurons and synapses are in the largest architectures, and how do those numbers compare to humans?

Unknown for the actual largest due to secrecy; 1% for the largest public models… but also organic ones are definitely a bit different from digital ones, and the jury is still out if those differences matter and if so by how much.

The comparison would therefore be with a mid-sized rodent, horse, or raven rather than a human.

(But even that's misleading, because the LLM doesn't have to use tokens to represent "contract left supracoracoideus" and "lay egg").

Edit: also, I've not heard much suggestion that anyone knows how certain genes do things like giving humans the inherent capability to recognise and create smiles or other similar reflexes, so we don't really know how much of our brains a pre-trained by evolution; furthermore, I think organic life is more sample-efficient for learning things than any AI so far.

Tokens aren't a necessary differentiator here. There is no fundamental technical reason why tokenization is used, it just has certain practical advantages. And the distinction almost disappears when we look at multimodal transformers, which process images, audio, and video broken apart into sequences of blocks of binary data.

There's no reason for any specific tokenisation, but the Transformer always has some tokenisation.

Tokens are allowed to be blocks of pixels, for example. No reason we couldn't have a token be a specific muscle or sensory nerve.

What I'm saying is that Large Language Models don't have a body, so no nerves and muscles to have to be represented within them; conversely, organic life does have those things and thus organic brains must spend some of their complexity on those things.

This means they have the possibility to equal us for language even with no capacity for vision, walking, tying shoelaces, or playing catch.

It’s a non starter to assume that virtual “synapses and neurons” behave like ours do. We barely understand how ours works.

Also, modern LLMs built on the transformers architecture no longer use the neuron-inspired perceptron style topology for most of their compute.

I’ve heard that spiking NNs are supposed to mimic organic brains more closely, but I haven’t read into them much yet.

The attention mechanism is in practice implemented using three linear layers. The matrix multiplication to average the output and to implement the masking is the only non-neuronal part of that computation, but it can be seen as an activation function.

Usually, linear perceptrons and ReLUs or GeLUs are used. Due to the enormous compute requirements to evaluate models of interesting size, other types of neuronal networks and activation functions have received very little attention (pun intended) so far.

Using ReLU instead of sigmoid is a significant departure with regards to how closely it models actual neurons.

Using non fully connected layers is as well. Our brains likely aren’t fully connected, but the connections that matter are made stronger through living life and learning.

If you squint, it’s kind of like training a dense series of linear layers, but that’s not what we’re doing anymore (for the better)

Comparing NNs to organic brains is an apples to oranges comparison, is what I’m saying.

Lack of adaption is mainly a feature, we choose not to train them in real-time and instead make available fixed models with repeatable behaviour. We could, if we wanted to, update the model weights continuously in response to feedback.

I think the biggest difference is that they need far more examples than we need, to learn anything.

> LLM's are more or less a dead end when it comes to AGI.

I don't think many people believe that LLMs are a way to AGI (whatever that actually means). But LLMs can still have many valid uses even if their prospects are limited in scope.

There are plenty of people - technical and non-technical - who seem to be acting like AGI is right around the corner thanks to LLMs, and who are, more broadly, vastly overstating the current capabilities of LLMs. I’m observing this in real life as much as on the internet. There are two very distinct groups of people that stand out to me: (1) High level execs with vested interests around AI and (2) Managers who haven’t even bothered to create an OpenAI account and are asking their subordinates to use ChatGPT for them, in what is an unforeseen usage of LLMs: by human proxy.

I recently read an interesting thread that laid out the case for LLMs being a path to AGI: https://old.reddit.com/r/singularity/comments/13ox85j/how_do...

The argument boils down to the idea that language isn't simply strings of words or bits of factual information, but an actual encoding of logic. By training statistical models on vast amounts of logic, we've given them a generalizable ability to perform logic. A sufficiently advanced LLM could thus potentially fulfill some definition of AGI.

To be clear, this doesn't in any way imply that LLMs could ever fit the definition of artificial consciousness, which would be a completely different form of strong AI. They're effectively just mathematical functions (albeit extremely complicated ones), which simply take inputs and return outputs without any intervening subjective experience. Even if they can perform a complicated task, retrieve and effectively summarize complicated information, or say all the right things as a conversational partner, they have no concept of the meaning of their output.

Maybe that limitation in itself puts a ceiling on their potential. Maybe the best possible LLM can only ever be 99.99% effective, and that 0.01% of the time it will go completely off the rails and disregard its instructions or hallucinate something ridiculous. Maybe the only way to overcome that is by keeping a human or a true artificial consciousness in the loop, in which case LLMs would still be extremely useful, but a flawed AGI if "AGI" at all. Or maybe a sufficiently advanced LLM and/or a sufficiently advanced error correction architecture will actually be enough to mitigate those issues.

I don't have a strong opinion on where LLMs are ultimately headed, but I'm looking forward to seeing how it all unfolds. It's amazing how capabilities that were strictly in the realm of sci-fi so quickly became mundane.

LLMs are definitely here to stay. Even if they don't turn out to be the road to AGI, they can be used by all sorts of sub-AGI agents as a "language centre". An encoder can be used to extract meaning from input, and an autoregressive decoder conditioned on the agent's internal state can be used to keep a conversation going. What's not clear at all is whether the traditional transformer architecture will endure.

I have yet to see an LLM that is cooperative. The magic of collaborating with someone is that we can both understand the problem and reason about it.

The current degree of LLM intelligence is not compelling for a social creature like me.

You could convince me with a React agent in a shared environment.

Do you have any models that you find compelling? Maybe a domain model that you like or have wanted to try.

Don't get me wrong, I still use LLMs, but they just really need that extra augmentation for any non-trivial task.

Surprised to read that.

I use them as a cooperative partner by default.

Also: quite a few people have had instances work with other instances, sometimes of the same model and sometimes of other models.

Perhaps I'm up too late, but I can't think what else is there to cooperation besides two or more agents doing things in alignment with some goal? (Regardless of who or what sets that goal).

Also I don't know what you mean by "conceptualization".

It's fuzzy because intelligence is relative right.

I mean "being able to conceive an idea". As humans, two or more of us can reason our way to a conclusion without domain knowledge. There is an upper limit where the idea is incomplete (assuming respectful ignorance), but it's generative nonetheless.

With an LLM I have to prompt engineer to guide it. I would rather have it generate novel concepts to push domain boundaries. They work great as knowledge bases though.

> As humans, two or more of us can reason our way to a conclusion without domain knowledge

That sounds like step-by-step thinking?

> With an LLM I have to prompt engineer to guide it.

I generally have to in humans, too. I mean, you and I are prompting each other, aren't we?

For me the difference between prompting a human and prompting an AI is that I can reset the AI, I can't make a human forget a previous analogy that had only confused them. (And likewise, I don't expect that I fully forget bad analogies which confuse me, though I do try).

> They work great as knowledge bases though.

IMO, that's their weakest part. We had knowledge bases before — where each claim can be easily localised within the model, corrected when it needs to be, verified in advance, and which give predictable output — LLMs are none of those things.

LLMs are much better at understanding the question (constant time for a fixed-length output, even when the query is phrased badly and relatively complex), and being able to synthesise things in the form of "${x} won't work, try ${y}".

Is it even allowed to ask questions??

Edit: my sience fiction joke in the 90s was AI though bots chatting in irc channels. They could seemlesly integrate human intelligence that way.

Have you ever talked to real average people?

I would say an LLM is more intelligent than at least some people I know. And in the domain of programming, most people I know. Simply by the fact that most people don't know programming.

LLMs are idiot savants that can do a few things very well and fail horribly at others. And they require careful prodding to correctly process tricky logical questions, exposing what they are at the core: text expanders and parroters. Highly useful of course to save typing effort and to aggregate insights over large context lengths. If anything, dealing with LLMs has helped me appreciate the capabilities of people more.

> exposing what they are at the core: text expanders and parroters.

They're much more than that. You can ask an LLM a question that it has never seen before, and it will give you a logical, reasonable answer. That requires knowledge of the world and the ability to reason.

LLMs aren't the same as humans, but neither are dogs or cats, and they're obviously intelligent in their own ways.

They will give that answer because they are forced to give it. The softmax amplifies whatever marginal outputs of the model head to a probability distribution. This means that if they don't have an answer, they are quite likely to "hallucinate" it. This is of course influenced by the patterns they learned. And directing them to be more structured also utilitizes pattern of structured thinking that is either part of finetuning or somewhere to be found in training data.

The cat/dog vs. human analogy is a very bad comparison since their brains work fundamentally like human brains, while transformers are something completely different.

Programmers aren’t any better than someone who doesn’t know how to program.

Programming skill isn’t a measure of intelligence.

Go outside. Talk to real people. Touch some grass.

My impression from github copilot is that hallucinations are the result of certain true facts having a low likelihood and copilot giving you the most likely answer anyway.

Typically I have a certain library that does things in a very unorthodox and undocumented way and when I ask copilot for an example it gives me wonderful, totally understandable code of made up functions that I wouldnt need in the first place if the library worked that way.

I dont think that running that query multiple times would help.

This is a very similar idea to ensemble models, which have been used for a long time in ML and proven to be very good. You average out the results of several predictors (or you let them vote and pick the most common prediction value), thereby reducing the noise in the prediction by choosing the common denominator of multiple predictions.

This is done in aerospace as well… however, even different teams clean room writing to the same spec have the tendency to make the same errors in their code, which ends up breaking the statistical model when this model was selected.

But if I set the temperature to 0, the model will pick the highest probable token and the output will be always the same. But we already know that by no mean it can guarantee a correct answer. So how can multiple runs be better?

Yes, but picking the most similar output from a bunch of queries with a higher temperature is not the same thing as the output from a single low temperature query.

Possibly, but it stills doesn’t explain why multiple runs will result in better answer. In the work, the authors also hasn’t compared the multiple runs results with the single run using zero temperature. So, maybe all the overhead is just to achieve the same result already encoded in the networks? I don’t know.

Also the result is somewhat counterintuitive. We know that by low level of understanding, if we ask a student a hard question and he tried many times, the most accurate answer is often not the most popular one but a single answer. And that by retaining memory, reasoning capacity and continuous learning , which is not the case with LLM.

Btw: HN is for discussion. If some just want to vote for the beauty contest, please leave.

I found this other paper that tests Temperature: https://arxiv.org/abs/2402.05201

It appears that temperature has no impact on problem solving performance. So this paper isn't getting improved performance because the token for the correct answer is more probable.

My theory is that the multiple queries are allowing the whole probability space of possible answers to be sampled. Not just the probabilities of the most likely output token, but the probabilities of all possible internal model states.

And sampling that probability space of the whole model state and finding the average is a very different mathematical operation to just picking a single model state at random and then picking the most probable output tokens.

If I'm reading this correctly, they had to discard Llama 2 answers and only use GPT-3.5 given answers to test the hypothesis.

GPT-3.5 answering questions through the OAI API alone is not an acceptable method of testing problem solving ability across a range of temperatures. OpenAI does some blackbox wizardry on their end.

There are many complex and clever sampling techniques for which temperature is just one (possibly dynamic) component

One example from the llama.cpp codebase is dynamic temperature sampling

https://github.com/ggerganov/llama.cpp/pull/4972/files

Not sure what you mean by whole model state given that there are tens of thousands of possible tokens and the models have billions of parameters in XX,XXX-dimensional space. How many queries across how many sampling methods might you need? Err..how much time? :)

> Also the result is somewhat counterintuitive. We know that by low level of understanding, if we ask a student a hard question and he tried many times, the most accurate answer is often not the most popular one but a single answer.

This is a bad analogy.

Here’s what is actually happening with no “common sense but wrong” understanding of it:

- You have a set of probabilities per token.

- You randomize them.

This is not a “bad student being asked multiple times” it is a system with randomized probabilities, creating a probability distribution.

If you want to see what a probability distribution looks like (eg. An electron cloud) then sampling only once is the wrong way to do it.

You basically have two distributions; the first one is the LLM, the second one is the shape generated by adding the random factor in the temperature.

This allows you to escape the “local maxima” encoded in the LLM distribution to find highly probable solutions that are outside the sample space of the “zero temperature”.

If you want a better analogy, look up at the night sky full of stars. Draw circle in the sky; that’s the LLM distribution.

The result from a zero temperature will be the brightest point in that circle.

When you push the temperature up, you blur the sky randomly. Some points become brighter, some dimmer, but the radius of the circle increases.

If there is a very bright point outside the sample circle 10x brighter than the brightest point inside it then repeated random samples will repeatedly find it.

It makes perfect sense that an expanded probability distribution sampled repeatedly could find a “good average solution” if that solution is significantly better than the best “zero temp” solution.

This is the same reason we have 'temp' at all; by widening the solution space probability distribution, you can find better maxima. Turns out, sampling multiple times lets you have more chances to find better maxima.

This is more like "well that seems obviously like a good idea" than "somewhat counterintuitive"; it's just slow and expensive to do it.

You can also adjust the probability distribution by other existing methods, obviously, what's surprising here is not that it works, but that it seem to work so well; probably (and I note they did not try this in their paper), a multi-sample + voting on the output from other methods would also be highly effective.

Just from reading comments around, it feels intuitive to me that looking at a heatmap of cascading pendulum would be more “accurate” than looking at just one snapshot, and also that joints on the pendulums don’t necessarily need to be interlinked between iterations of simulations

> Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic

I expect that to give you something close to the confidence of the underlying model to some specific claim, which is good, but I still expect legends (urban and cultural) to be high-ranked.

They'd be very human mistakes, but still mistakes.

I think the only way past that is to build a world model, look for contradictions, and then look for new evidence to resolve those contradictions.

Be interesting to plug this into a bayesian optimization like framework: find out regions of language space where the models maximally disagree and then target those areas for extra training

I had a very similar idea a few months ago. I wanted to use this approach to have the LLM provide the probability that the generated answer is correct. The probability would simply be what fraction of all generated answers was the one selected. (Each generated answer would be generated with a different seed and the question would be of single choice kind.) The two issues I found were 1) the cost, 2) on some problems, LLMs can be wrong more often than they are not.

Hopefully, as inference gets cheaper and of higher quality, someone will come up with a more feasible solution.

Could multiple agents be used such that tokens emitted from LLM A is passed to B and output of B is passed to A meaning 2 agents will be being used to generate an output in a simple round Robin way? Both will share context in this case. My computer isn't big enough run two large models but this can be tried on tiny models perhaps.

I realize that for more than two and very specialised agents this will require some intelligent way to pass the output to specialist agents only. And also this means that their must be some overlap between the agents.

That is what’s already been done under the term "multi-agent". This paper argues that there’s no need for any such message-passing or context sharing, you just literally run the same query several times on the same model, fully independently, and then pick a "typical" reply according to some similarity metric.

I don't think this type of method can scale indefinitely, it's essentially just "better" sampling within dense areas of knowledge space. It cannot help with better exploration outside these dense areas, because these explorations won't have a consensus among agents almost by definition.

>Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic

Not my experience. I had multiple LLMs hallucinate hard when asked same question multiple times. The only way to break the cycle is to follow everything with questions demanding clarifications. "are you sure?" "this is wrong, correct the answer".

> I'm not sure people in these comments are reading this paper correctly. > This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.

I'm not sure you have read the paper at all. Chain of thought prompting is not a multi-agent algorithm. The paper says that it enhances existing methods such as prompt engineering (chain of thought) and multi-agent debate. The sampling method presented in the paper is orthogonal to those methods.

Finally. I've been saying that we need to stop focusing on a single agent getting everything right and instead layer agents for about 16 months now, but it's great to have a paper to point to.

It's interesting that the diminishing returns for tasks flatten out rapidly around the same size as the ideal human meeting sizes: https://www.researchgate.net/figure/18-Optimal-Meeting-Sizes...

If this was done at more granular steps of agent quantity I'm curious just how closely it would match those numbers.

I'd also really love to see the eventual follow-up where we see how much more performance can be obtained when the agents are each fine tuned towards slightly different aims. I'd expect there'd even be a performance lift from just having the agents each set at different temperature levels.

Very happy to see the research community starting to step in this direction!

Great videos.

I have one personal niggle: I get annoyed when we end up lying to ourselves. Regarding the 101 section in video 1 - People forgot this the day LLMs came out. I felt this was too generous with the benefit of doubt.

This basic point was and remains constantly argued - with “Emergence” and anthropomorphization being the heart of the opposing argument.

I think it's way more than 8 even. And it's common to have many working as supervisors, often at conflict with each other. And some act out the automatic trauma responses, as they're stuck in the past when the trauma occurred.

We have tons of specialized components that work together cooperatively and competitively. There’s multiple ways they connect. There also seems to be global processes that happen, like during sleep. There’s over 3,000 cell types per the BRAIN initiative. Every brain forms on it’s own taking shape like something out of a Transformers movie.

God’s design is mostly nothing like man’s neural networks. It’s far superior. Brains are also what’s creating all the artificial, neural nets on top of all math, tech, and economic systems that they run on. AI’s got a lot of catching up to do.

Maybe. I'm sure one's consciousness corresponds with one's guiding philosophy.

I don't think this supervisor model is generally applicable to people with EFD or some forms of Autism, for example.

And et voila, you have the script of inside out. \s

But honestly I do think this is how we operate. Depending on our state of metabolism and other psychological factors, the dominant version changes but as a whole we remain the sum total of all these versions.

Kind of. More like a mixture of a mixture of experts.

The problem is MoE on its own isn't able to use the context as a scratch pad for differentiated CoT trees.

So you have a mixture of token suggestions, but a singular chain of thought.

A mixture of both is probably going to perform better than just a mixture of the former, especially given everything we know by now regarding in context learning or the degree of transmission synthetic data is carrying.

This seems related to an interesting recent ACM ByteCast podcast episode with Edward Chang, an Adjunct Professor in the Department of Computer Science at Stanford University. [1] (Note there is a transcript if you don't want to listen.)

The approach he uses is to arrange for multiple LLMs to dialogue between each other about a discussion topic where the human acts as a moderator instead of the question/answer format that LLMs commonly take today. They find that the final answer that multiple LLMs come to in dialogue results in a huge improvement in both precision and accuracy for the same resources.

[1]: https://learning.acm.org/bytecast/ep50-edward-y-chang

In optimization problems, randomness can often get you out of local minima/maxima, and so averaging out a bunch of random search paths might get you better results in the worst case. Something similar might be happening here. The training set will be biased in various ways that might create weird local min/max points and so this process could avoid those weird kinks.

The paper says that it enhances existing methods such as prompt engineering (chain of thought) and LLM debate. This agent method is orthogonal to LLM debate.

I built something like this in Haskell! I never benchmarked it, but I actually found it quite compelling. I would define each agent as a different "expert" in a subdomain of mathematics for example: proof theorist, abstract algebraic expert, etc.

I found it helpful, but the signal to noise ratio was high, lots of agents restating points etc.

One frustration I've had with all this mixture-of-experts research:

Randomized Algorithms 101 - or basic stochastic reasoning - suggests that if the temperature parameter is > 0, querying an LLM N times and picking the majority result (perhaps with an N+1th query to the LLM) will generally result in better performance than asking it once and choosing that result.

It seems plausible to me that the gains can be further improved with a specialized mixture of different LLMs (which could then be run at temp = 0), or by finding better ways to break tasks into subtasks as this paper suggests. But AFAICT nobody has done anything to actually quantify these hypothetical gains versus the dumb randomized algorithm approach! In particular there might be voting strategies or mixtures - even specific models - where MoE/etc is strictly worse than naive repetition.

I am a concerned citizen w.r.t LLMs rather than a researcher, so I might be missing something. It just seems odd that LLM researchers forgot the first chapter of Motwani/Raghavan.

I'd assume that there's a difference between picking the best _token_ across an assortment of randomly selected tokens, versus picking the best _string_ of randomly-selected tokens.

Eyeballing the graphs, it seems that most of the gain is with 10 agents, a bit more with 20, and there are diminishing returns after that. Apparently, more agents isn't going to do it.

Is this not an incredibly expensive/unsustainable method? I agree with the sentiment that MoE is the way to go as the newer models will probably see diminishing returns. But the compute for a single prompt will suddenly increase 7-15 fold?

If GPT4 is 20x the price of GPT3.5, but it only takes 10x GPT3.5 runs to get similar quality of response (and likely faster), you'll still come out ahead.

I doubt that 10xGPT3.5 > GPT4. There are a lot of tasks that GPT4 can do and GPT3.5 just cannot. Also, in such cases I find that GPT3.5's hallucinations are quite consistent, so such a method is probably not gonna help.

Just reading the (current top) few comments and whimsically wondered at the super business model of companies offering LLM services: a car service that won't get you from point A to B unless you hail it n times. A detergent that must be applied n times before cloths come out ("probably") clean.

If a company is offering "Artificial intelligence" at a price, then isn't reasonable that you only pay for correct answers? If the company is offering car service, shouldn't you only pay if they take you to your destination?

Companies usually offer a service or a product. If the company doesn't deliver what was agreed upon, then the customer can demand correction. If a taxi driver takes a needlessly convoluted route, charges too much, or doesn't bring you to the destination, you can complain to the taxi company. If the laundry didn't work, you insist on doing it again.

However, many activities are inherently fraught with risk or uncertain results since there are always things outside of anyone's control. A lawyer can't promise you prevail in a court case, but they have to advocate your case to the best of their abilities. A doctor won't guarantee that you become healthy again. No taxi driver will guarantee you that you will reach the destination in time, but they will bring you there. Atlassian won't guarantee you will meet a release deadline if you use their managed JIRA instance, but they will do their best to prevent data loss. And a company that basically sells access to a chatbot won't guarantee that it gives you correct results. Maybe availability guarantees.

Agreed, and if it fails often enough, isn’t the bar at which a human or general-purpose, traditionally structured automation is going to be superior pretty low? This is how I think this bubble will pop. No doubt, LLMs are a breakthrough tool, but I’m sincerely skeptical of all but the most granular applications.

Perhaps the moral is that diffusing LLM agent accountability has the same failure model as the pre-existing human one.

I guess it's the difference between an ensemble and a mixture of experts, i.e. aggregating outputs from (a) model(s) trained on the same data vs different data (GPT-4). Though GPT-4 presumably does not aggregate, but it routes.

> GPT-4 is actually a pile of 3.5s

I understand the intension and reference you're making. I bet the implementation of GPT-4 is probably something along those lines. However, spreading speculation in definitive language like that when the truth is unknown is dishonest, wouldn't you agree?

Sure, I could it put it less definitively, but realistically, what else can it be? The transformer won't change much and all of the models, at the core use it. It's a closely guarded secret because it's easy to replicate.

This is a cool paper showing there is value in using an LLM multiple times, but in recent research we showed that with majority voting, quality can decrease past some point as you make more calls. Check out https://arxiv.org/pdf/2403.02419.pdf. It raises the natural question of how to design the best inference algorithm given an LLM you can call multiple times.

This paper is specifically disproving the efficacy of agentic frameworks like AutoGen.

Also, the built-in function-calling in GPT4 is simpler to use than AutoGen2's abstraction.

So this is an ensemble of many LLMs?

I wonder how well a bunch of LLMs trained on personal computers, so fairly small, could perform together?

Train a LLM on your emails, train an LLM on a text book, download a bunch of arbitrary LLMs from the net you find interesting, throw them all together into a big pile, and use a moderator LLM that knows how to format their output into an assistant format.

So, the email LLM would try to autocomplete sentences from your emails, and the text book LLM would try to autocomplete sentences from the text book. People could offer LLMs to download, almost as a way of compressing information, download the LLM of your favorite programming language, and TV series, etc. The important part would be having a moderator algorithm that can shape these LLMs from dumb sentence autocompleters (barely more than a fancy Markov chain) into a coherent assistant format. For example, the text book LLM would just endlessly spew semi-random sentences from the text, but a good moderator algorithm could see that it has sufficiently answered the question and cut it off.

In short, it's interesting that separate LLMs can integrate with each other and strengthen each other and it makes me wonder if we could build modular LLMs.

Your idea inspired me to see what such a microstory based on your idea would look like (Of course generated by ChatGPT3.5):

> As I delved into my computer, eager to tackle my to-do list, I was met with an unexpected sight: a digital love triangle among the Language Models (LLMs). The Email LLM, with its quick wit, seemed to be engaging in flirtatious banter with the verbose Textbook LLM, while the Programming Language LLM watched on with amusement. I couldn't help but laugh at the absurdity of it all, but as the bickering between the LLMs intensified, I realized their antics were hindering my progress. With a mixture of frustration and amusement, I gently redirected the LLMs back to their intended purpose, finally able to accomplish my task amidst the chaotic comedy within my computer.

Model ensemble is a classic method. Deep learning is always rediscovering and reinventing the classics. I believe that many people have used this method before, but they just haven't published a paper on it.

I've saved the paper to read it later.

The premise of this work seems very interesting... But I wonder how practical it is from both a cost and time perspective. I am toying around with an AI Agents library and one of the annoying UX things I notice is the time it takes to get my answers, because each call to an agent (either GPT-4 or Claude 3) is kinda slow.

Besides the time, it feels quite wasteful token wise.

I'm skeptical this approach will be adopted by many in the AI Agent space, but of course I could be very wrong.

This is my go to method for pretty much every hard problem that I'm forced to solve where I don't have the domain expertise / interest / time. The trick lies in coming up with a clever similarity metric that incorporates penalties etc. You can even go a level deeper and use multiple similarity algorithms and then poll on top of them. Here's a taxonomy extractor for text that I made using similar principles that is surprisingly as good as anything else that I've seen - https://dash.scooptent.com/text

What's interesting is that each run of the model tends to converge to a different "local maximum" in the solution space, and some of these local maxima correspond to better performance than others. By running the model multiple times, we increase the chances of finding a higher-quality local maximum or even the absolute best solution.

This got me thinking: why is this ensembling step implemented as a higher-level abstraction on top of the base LLM, rather than being built directly into the neural network architecture and training process itself?

Well you’re right that LLM tooling is totally inadequate. At least we already have beam search. But the more boring answer (and why beam search is also uncommon) is that running the query multiple times is more expense.

Reading the paper the thought occurs that IF the likelihood of correct response increases by the number of agents employed AND this involves application of a function (whatever) to select the 'best' from the possible answers, doesn't that imply that LLM has insufficient dimensions?

In other words, I am wondering if LLM hallucinations [sic] are in fact symptomatic of 'conflation' which could itself be the result of insufficient dimensions.

Thoughts?

This is quite interesting because I've specifically tried this kind of basic ensembling for my NYT Connections benchmark and it didn't work. This is something everybody would try first before more complicated multi-step prompting, and yet since ChatGPT 3.5 I'm not aware of any papers showing that it works. It will be interesting to reproduce this result and learn more about how they set it up to make it work.

Here is a link to the main diagram: https://anonymous.4open.science/r/more_agent_is_all_you_need...

Seems like a pretty brute force approach of frankly just throwing more compute at the query (via semi-statistical means).

I'd be more interested in how to scale this via different agents. i.e. do we have say one type of agent that is specialized to produce ideas, while another is trained to evaluate ideas. Those sort of chains seem like they'd be powerful - if you can find a way to generalize it

I remember hearing that Beam Search doesn't work well for LLMs, because it leads to repetitive, generic output.

The majority vote sampling technique in this paper sounds like it'd give similar output to Beam Search, because it's sampling sequences of tokens from a joint distribution. So why doesn't it give repetitive output like Beam Search does? What am I missing?

Having given this problem a great deal of thought, I have developed a strong intuition around this. I believe not only is AGI feasible, it is already doable.

For example, several hundred GPT-4 based agents specializing in different skill sets should be able to collaboratively solve many problems. Their ability to work on so many facets of the same problem will make them very effective against multidisciplinary problems.

What’s the catch? Well, the back and forth has to play out in a serial order, so it cannot be parallelized. At today’s abysmal inference speeds, it may take this AGI many times longer than a trained human. Now imagine the effectiveness of this method when we can speed up inference to several hundred times a minute. Now AGI suddenly becomes way more efficient than a human.

this study's got me worried. it goes against the hope that AGI won't just sneak onto the internet and do its own thing. with more work on making LLMs bigger and showing they can do more, the thought that this could actually happen in the future gets real scary, especially since i can run smaller versions on my own pc now.

Does it take less compute to train N agents vs one large model? Seem like a big win. Can the majority of the training be done independently or in distributed fashion?

This trend of "All You Need" in paper titles needs to die. The original "Attention is All You Need" used that title because it is literally true in their case. So many papers just use it as a meme now, and it distracts from the true insight of the paper.

it's uncreative/tired but at least to the point. Too many papers are confused / opaque agglomerations of a year's worth of research shoehorned into a paper. At least with these you can fairly easily assess whether the claim is supported or not.

Obviously, more people are going to read it.

It is like putting a stupid face on your youtube video to show how shocked and amazed you are at the content.

I feel I used the word 'meme' appropriately

"A meme is an idea, behavior, or style that spreads by means of imitation from person to person within a culture and often carries symbolic meaning representing a particular phenomenon or theme."

Paper authors use "All You Need" to allude to the well-known transformer paper, even if their proposed technique is not in fact all you need.

You used the word in the sense it commonly used, which is something like "often-repeated" or, more frequently, "shared a lot on social media."

While I agree that this is consistent with the WP definition in the broadest sense, it isn't really what Dawkins had in mind in the Selfish Gene.

My understanding is that a meme spreads through memetics similar to how a gene spreads through genetics. Ideas will spread that are fit for reproduction, i.e. communication.

Oftentimes memes are powerful at reproducing, but in science we don't want ideas that are likely to spread. We want ideas that are truthful

How about swarms of autonomous agents, such as AutoGPT, maybe thousands per human eventually, amassing karma points on all forums, including this one?

I can see in a few years each human being surrounded by a ton of LLM agents, "shepherding" their views, downvoting their messages or distracting them with argumentative conversations if they don't conform, and facilitating reputational attacks on scale on all the people whose speech is recognized as being contrary to what's desired.

Of course, there wouldn't be just one group deploying these swarms. It would be lots of different groups, akin to slaughterbots video: https://www.youtube.com/watch?v=O-2tpwW0kmU

The difference is that there wouldn't be physical violence, it would just gradually turn the entire Internet into a dark forest.

Well then most will be demanding that our governmental institutions issue cryptographically signed ID cards verified by an in-person visit to the DMV.

Or, you choose to opt out and swim in a sea of nonsense.

Averaging LLM outputs will ensure the final output will contain a lot of words with no substance. However, it’s essential to recognize that averaging bad data doesn’t always lead to better results. Garbage in, garbage out — averaging cannot magically transform flawed inputs into accurate outputs.

（评论） (comments)

（评论）
(comments)