![]() |
|
![]() |
|
>> Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer. https://en.wikipedia.org/wiki/Lorenz_system Years ago weather simulations started tweaking input params and running their models over and over. Discarding outliers, taking averages. It works pretty well. Because LLM's mostly have random seeds (aka temperature) feeding them the same input and averaging the output is going to get you a better guess. Lorenz also gives some clues (if not an outright explanation) as to why the "hallucination" problem is likely unsolvable. If you buy into this line of thinking then it quickly becomes apparent that LLM's are more or less a dead end when it comes to AGI. Simulating isnt emulating... an LLM is as likely to become intelligent as a forecast is to control the weather. |
![]() |
|
> If I study Einstein and learn to do a really good impression, the statement "Einstein often sounds like karmacondon" will be true. Wrong alt, hooande ;) |
![]() |
|
Right. This is Searle's "a simulated plane won't get you to Japan" argument. That's true. But a simulated calculator is perfectly effective for doing your taxes. |
![]() |
|
I recently read an interesting thread that laid out the case for LLMs being a path to AGI: https://old.reddit.com/r/singularity/comments/13ox85j/how_do... The argument boils down to the idea that language isn't simply strings of words or bits of factual information, but an actual encoding of logic. By training statistical models on vast amounts of logic, we've given them a generalizable ability to perform logic. A sufficiently advanced LLM could thus potentially fulfill some definition of AGI. To be clear, this doesn't in any way imply that LLMs could ever fit the definition of artificial consciousness, which would be a completely different form of strong AI. They're effectively just mathematical functions (albeit extremely complicated ones), which simply take inputs and return outputs without any intervening subjective experience. Even if they can perform a complicated task, retrieve and effectively summarize complicated information, or say all the right things as a conversational partner, they have no concept of the meaning of their output. Maybe that limitation in itself puts a ceiling on their potential. Maybe the best possible LLM can only ever be 99.99% effective, and that 0.01% of the time it will go completely off the rails and disregard its instructions or hallucinate something ridiculous. Maybe the only way to overcome that is by keeping a human or a true artificial consciousness in the loop, in which case LLMs would still be extremely useful, but a flawed AGI if "AGI" at all. Or maybe a sufficiently advanced LLM and/or a sufficiently advanced error correction architecture will actually be enough to mitigate those issues. I don't have a strong opinion on where LLMs are ultimately headed, but I'm looking forward to seeing how it all unfolds. It's amazing how capabilities that were strictly in the realm of sci-fi so quickly became mundane. |
![]() |
|
Surprised to read that. I use them as a cooperative partner by default. Also: quite a few people have had instances work with other instances, sometimes of the same model and sometimes of other models. |
![]() |
|
Is it even allowed to ask questions?? Edit: my sience fiction joke in the 90s was AI though bots chatting in irc channels. They could seemlesly integrate human intelligence that way. |
![]() |
|
Programmers aren’t any better than someone who doesn’t know how to program. Programming skill isn’t a measure of intelligence. Go outside. Talk to real people. Touch some grass. |
![]() |
|
Yes, but picking the most similar output from a bunch of queries with a higher temperature is not the same thing as the output from a single low temperature query.
|
![]() |
|
I found this other paper that tests Temperature: https://arxiv.org/abs/2402.05201 It appears that temperature has no impact on problem solving performance. So this paper isn't getting improved performance because the token for the correct answer is more probable. My theory is that the multiple queries are allowing the whole probability space of possible answers to be sampled. Not just the probabilities of the most likely output token, but the probabilities of all possible internal model states. And sampling that probability space of the whole model state and finding the average is a very different mathematical operation to just picking a single model state at random and then picking the most probable output tokens. |
![]() |
|
If I'm reading this correctly, they had to discard Llama 2 answers and only use GPT-3.5 given answers to test the hypothesis. GPT-3.5 answering questions through the OAI API alone is not an acceptable method of testing problem solving ability across a range of temperatures. OpenAI does some blackbox wizardry on their end. There are many complex and clever sampling techniques for which temperature is just one (possibly dynamic) component One example from the llama.cpp codebase is dynamic temperature sampling https://github.com/ggerganov/llama.cpp/pull/4972/files Not sure what you mean by whole model state given that there are tens of thousands of possible tokens and the models have billions of parameters in XX,XXX-dimensional space. How many queries across how many sampling methods might you need? Err..how much time? :) |
![]() |
|
Be interesting to plug this into a bayesian optimization like framework: find out regions of language space where the models maximally disagree and then target those areas for extra training
|
![]() |
|
Finally. I've been saying that we need to stop focusing on a single agent getting everything right and instead layer agents for about 16 months now, but it's great to have a paper to point to. It's interesting that the diminishing returns for tasks flatten out rapidly around the same size as the ideal human meeting sizes: https://www.researchgate.net/figure/18-Optimal-Meeting-Sizes... If this was done at more granular steps of agent quantity I'm curious just how closely it would match those numbers. I'd also really love to see the eventual follow-up where we see how much more performance can be obtained when the agents are each fine tuned towards slightly different aims. I'd expect there'd even be a performance lift from just having the agents each set at different temperature levels. Very happy to see the research community starting to step in this direction! |
![]() |
|
Maybe. I'm sure one's consciousness corresponds with one's guiding philosophy. I don't think this supervisor model is generally applicable to people with EFD or some forms of Autism, for example. |
![]() |
|
The paper says that it enhances existing methods such as prompt engineering (chain of thought) and LLM debate. This agent method is orthogonal to LLM debate.
|
![]() |
|
I'd assume that there's a difference between picking the best _token_ across an assortment of randomly selected tokens, versus picking the best _string_ of randomly-selected tokens.
|
![]() |
|
Eyeballing the graphs, it seems that most of the gain is with 10 agents, a bit more with 20, and there are diminishing returns after that. Apparently, more agents isn't going to do it.
|
![]() |
|
If GPT4 is 20x the price of GPT3.5, but it only takes 10x GPT3.5 runs to get similar quality of response (and likely faster), you'll still come out ahead.
|
![]() |
|
This is a cool paper showing there is value in using an LLM multiple times, but in recent research we showed that with majority voting, quality can decrease past some point as you make more calls. Check out https://arxiv.org/pdf/2403.02419.pdf. It raises the natural question of how to design the best inference algorithm given an LLM you can call multiple times.
|
![]() |
|
This paper is specifically disproving the efficacy of agentic frameworks like AutoGen. Also, the built-in function-calling in GPT4 is simpler to use than AutoGen2's abstraction. |
![]() |
|
This is my go to method for pretty much every hard problem that I'm forced to solve where I don't have the domain expertise / interest / time. The trick lies in coming up with a clever similarity metric that incorporates penalties etc. You can even go a level deeper and use multiple similarity algorithms and then poll on top of them. Here's a taxonomy extractor for text that I made using similar principles that is surprisingly as good as anything else that I've seen - https://dash.scooptent.com/text
|
![]() |
|
Here is a link to the main diagram: https://anonymous.4open.science/r/more_agent_is_all_you_need... Seems like a pretty brute force approach of frankly just throwing more compute at the query (via semi-statistical means). I'd be more interested in how to scale this via different agents. i.e. do we have say one type of agent that is specialized to produce ideas, while another is trained to evaluate ideas. Those sort of chains seem like they'd be powerful - if you can find a way to generalize it |
![]() |
|
Does it take less compute to train N agents vs one large model? Seem like a big win. Can the majority of the training be done independently or in distributed fashion?
|
![]() |
|
Obviously, more people are going to read it. It is like putting a stupid face on your youtube video to show how shocked and amazed you are at the content. |
![]() |
|
How about swarms of autonomous agents, such as AutoGPT, maybe thousands per human eventually, amassing karma points on all forums, including this one? I can see in a few years each human being surrounded by a ton of LLM agents, "shepherding" their views, downvoting their messages or distracting them with argumentative conversations if they don't conform, and facilitating reputational attacks on scale on all the people whose speech is recognized as being contrary to what's desired. Of course, there wouldn't be just one group deploying these swarms. It would be lots of different groups, akin to slaughterbots video: https://www.youtube.com/watch?v=O-2tpwW0kmU The difference is that there wouldn't be physical violence, it would just gradually turn the entire Internet into a dark forest. |
This seems to essentially disprove the whole idea of multi-agent setups like Chain-of-thought and LLM-Debate.
Because this paper introduces their alternative method which simply runs the same query multiple times on the same LLM, without any context shared across queries. And then they run a similarity algorithm on the answers and pick the most common answer. (Which makes sense to me. If an LLM is giving you a mixture of "hallucinations" and correct answers, the correct answers will similar and the hallucinations will hopefully be chaotic)
And this simple algorithm preform just as well (and sometimes better) than all the other multi-agent algorithms.
This suggests that the other multi-agent schemes with their clever prompts aren't really doing anything special; Their improve results are coming mostly from the fact that the LLM is run multiple times, that the prompt asks the LLM to pick the best answer.