![]() |
|
![]() |
| No the reason for the disappointment was that early AI pioneers considered chess a model of human intelligence and they expected a chess-playing AI to help them understand how human intelligence works. To have computer chess devolve into a race to beat human champions using techniques that only computers can use clearly defeated this purpose.
Those "early pioneers" were people like Alan Turing, Claude Shannon, Marvin Minsky, Donald Michie and John McCarthy, all of whom were chess players themselves and were prone to thinking of computer chess as a window into the inner workings of the human mind. Here's what McCarthy had to say when Deep Blue beat Kasparov: In 1965 the Russian mathematician Alexander Kronrod said, "Chess is the Drosophila of artificial intelligence." However, computer chess has developed much as genetics might have if the geneticists had concentrated their efforts starting in 1910 on breeding racing Drosophila. We would have some science, but mainly we would have very fast fruit flies. Three features of human chess play are required by computer programs when they face harder problems than chess. Two of them were used by early chess programs but were abandoned in substituting computer power for thought. http://www-formal.stanford.edu/jmc/newborn/newborn.html Then he goes on to discuss those three features of human chess play. It doesn't really matter which they are but it's clear that he is not complaining about anyone "playing wrong", he's complaining about computer chess taking a direction that fails to contribute to a scientific understanding of human, and I would also say machine, intelligence. |
![]() |
| Yeah, very good points. To be fair there are people who have argued the big data side who have clearly solid knowledge of AI and are not just SV suits, for example I remember Yann LeCun in a debate with Christopher Manning, where Manning was arguing for the importance of "structure" and LeCun was arguing against it. Or see the "Bitter Lesson", mentioned in a parent comment. That may have become a total shibboleth of the Silicon bros but Rich Sutton, who wrote the eponymous article, is the guy who wrote the book on Reinforcement Learning (literally). And then Rodney Brooks' replied with his "Better Lesson" (https://rodneybrooks.com/a-better-lesson/). So there's a lot of debate in this and I don't reckon we'll have a consensus soon. It should be clear which side I'm on- I work with firmly model-based AI ("planning is the model-based approach to autonomous behaviour" has become my shibboleth - see Bonnet and Geffner's book on planning: https://link.springer.com/book/10.1007/978-3-031-01564-9) so maybe it's a deformation professionelle. And even LCun's recent plans for JEPA are very consciously model-based, except he wants to learn his models from data; which is not a bad idea I suppose.
|
![]() |
| > You're saying the publicly available problem set isn't indicative of the distribution of the test set?
Yes. From https://arcprize.org/guide:
|
![]() |
| Do you have any perspectives to share on Ryan's observation of a potential scaling law for these tasks and his comment that "ARC-AGI will be one benchmark among many that just gets solved by scale"? |
![]() |
| > I studied algorithms for years.
Who hasn't? > You're 100% WRONG on everything you wrote. Maybe you should update the wikipedia page, then all the other textbooks, that uses a definition of brute-force that matches my understanding of it. From https://en.wikipedia.org/wiki/Brute-force_search > Therefore, brute-force search is typically used when the problem size is limited, or when there are problem-specific heuristics that can be used to reduce the set of candidate solutions to a manageable size. Further, in the same page https://en.wikipedia.org/wiki/Brute-force_search#Speeding_up... > One way to speed up a brute-force algorithm is to reduce the search space, that is, the set of candidate solutions, by using heuristics specific to the problem class. I mean, the approach under discussion is literally exactly this. Now, Mr "ACM ICPC, studied algorithms for years", where's your reference that reducing the solution space using heuristics results in a non-brute-force algorithm? |
![]() |
| > Reference? Link, even?
Sure, here's definition for "brute force" from university textbook material written by pllk, who has taught algorithms for 20 years and holds a 2400 rating on Codeforces: https://tira.mooc.fi/kevat-2024/osa9/ "Yleispätevä tapa ratkaista hakuongelmia on toteuttaa raakaan voimaan (brute force) perustuva haku, joka käy läpi kaikki ratkaisut yksi kerrallaan." edit: Here's an English language book written by the same author, though the English source does not precisely define the term: In chapter 5: "Complete search is a general method that can be used to solve almost any algorithm problem. The idea is to generate all possible solutions to the problem using brute force ..." And a bit further down chapter 5: "We can often optimize backtracking by pruning the search tree. The idea is to add ”intelligence” to the algorithm so that it will notice as soon as possible if a partial solution cannot be extended to a complete solution. Such optimizations can have a tremendous effect on the efficiency of the search." Your mistake is that you for some reason believe that any search over solution space is a brute force solution. But there are many ways to search over a solution space. A "dumb search" over solution space is generally considered to be brute force, whereas a "smart search" is generally not considered to be brute force. Here's the Codeforces profile of the author: https://codeforces.com/profile/pllk edit 2: Ok now I think I understand what causes your confusion. When an author writes "One way to speed up a brute-force algorithm ..." you think that the algorithm can still be called "brute force" after whatever optimizations were applied. No. That's not what that text means. This is like saying "One way to make a gray car more colorful is by painting it red". Is it still a gray car after it has been painted red? No it is not. |
![]() |
| It's not that novel. Others have implemented this approach , in the context of mathematics.
Already the 2021 paper Drori (and many papers since) did similar things. It's a common idea in this space... |
![]() |
| Do you have any suggestions for a better approach of testing artificial intelligence? I mean, in a way that allows comparing different approaches and being a reasonable metric of progress. |
![]() |
| That kind of neuro-symbolic AI is a bit like British cuisine: place two different things next to each other in the same plate, like bangers and mash, and call it "a dish".
Nope. This is neurosymbolic AI: Abductive Knowledge Induction From Raw Data https://www.doc.ic.ac.uk/~shm/Papers/abdmetarawIJCAI.pdf That's a symbolic learning engine trained in tandem with a neural net. The symbolic engine is learning to label examples for the neural net that learns to label examples for the symbolic engine. I call that cooking! (Full disclosure: the authors of the paper are my thesis advisor and a dear colleague). |
![]() |
| > GPT-4 is an AGI, just a very bad one.
Then stop selling it as a tool to replace humans. A fast moving car breaking through a barrier and flying off the cliff could be called "an airborne means of transportation, just a very bad one" yet nobody is suggesting it should replace school busses if only we could add longer wings to it. What the LLM community refuses to see is that there is a limit to the patience and the financing the rest of the world will grant you before you're told, "it doesn't work mate." > So at what point does a human go from not generally intelligent to generally intelligent? Developmental psychology would be a good place to start looking for answers to this question. Also, forgetting scientific approach and going with common sense, we do not allow young humans to operate complex machinery, decide who is allowed to become a doctor, or go to jail. Intelligence is something that is not equally distributed across the human population and some of us never have much of it, yet we function and have a role in society. Our behaviour, choices, preferences, opinions are not just based on our intelligence, but often on our past experiences and circumstances. It is also not the sole quality we use to compare ourselves against each other. A not very intelligent person is capable of making the right choices (an otherwise obedient soldier refusing to press the button and blow up a building full of children); similarly, a highly intelligent person can become a hard-to-find serial criminal (a gynecologist impregnating his patients). What intelligent and creative people hold against LLMs is not that they replace them, but that they replace them with a shit version of them relegating thousands of years of human progress and creativity to the dustbin of the models and layers of tweaks to the output that still produce unreliable crap. I think the person who wrote this sign summed it up best https://x.com/gvanrossum/status/1802378022361911711 |
![]() |
| The "general" part of AGI implies it should be capable across all types of different tasks. I would definitely call it real Artificial Intelligence, but it's not general by any means. |
![]() |
| Hmm, but is it really "generalizing" or just pulling information from the training data? I think that's what this benchmark is really about: to adapt to something it has never seen before quickly. |
![]() |
| I won't be surprised if GPT-5 would be able to do it: it knows that it's LLM, so it knows its limitations. It can write code to pre-process input in a format which is better understood, etc.
https://chatgpt.com/share/2fde1db5-00cf-404d-9ae5-192aa5ac90... GPT-4 created a plan very similar to the article, i.e. it also suggested using Python to pre-process data. It also suggested using program synthesis. So I'd say it's already 90% there. > "Execute the synthesized program on the test inputs." > "Verify the outputs against the expected results. If the results are incorrect, iteratively refine the hypotheses and rules." So people saying that it's ad-hoc are wrong. LLMs know how to solve these tasks, they are just not very good at coding, and iterative refinement tooling is in infancy. |
![]() |
| its LLM grade school. let them cook, train these things to match utility in our world. I'm not married to the "AGI" goal if there is other utility along the way. |
![]() |
| Seems that Arc-AGI is more flawed rather than GPT-4o is more AGI.
Maybe a AI version of Hanlons Razor. Never attribute to AGI what could be easily explained by being in the training set. |
![]() |
| > 50% accuracy on the public test set for ARC-AGI by having GPT-4o
Isn't the public test set public on github and therefore GPT-4o trained on it? |
![]() |
| Public discussions of solutions to the public test set will presumably have somewhat similar analogies and/or embeddings to aspects of the python programs that solve them. |
![]() |
| Einstein, infamously, couldn't really make much progress with quantum physics, even though he invented the precursors (ex. Brownian motion). Your world model is hard to update. |
![]() |
| A bit of a stretch given that Chollet is a researcher in deep learning and transformers and his criticism is that memorization (training LLMs on lots and lots of problems) doesn't equate to AGI. |
![]() |
| The stretch was in reference to comparing Chollet to Einstein. Chollet clearly understands LLMs (and transformers and deep learning), he simply doesn't believe they are sufficient for AGI. |
![]() |
| You're reading a whole lot into a tweet, in his interview with Dwarkesh Patel he says, about 20 different times, that scaling LLMs (as they are currently conceived) won't lead to AGI. |
![]() |
| You keep changing topics so I don't get it either, I can attest it's not a fringe view that the situation is interesting, seen it discussed several times today by unrelated people. |
![]() |
| But it's his EPR paper inspired the Bell's inequality and pushed the field further. Yes he was wrong about how reality works, but still he asked the right question. |
![]() |
| > a very large percentage of people living today are also not thinking/understanding/sentient
This isn't that big a bullet to bite (https://www.lesswrong.com/posts/4AHXDwcGab5PhKhHT/humans-who... comes from well before ChatGPT's launch), and I myself am inclined to bite it. System 1 alone does not a general intelligence make, although the article is extremely interesting in asking the question "is System 1 plus Python enough for a general intelligence?". But it's not a very relevant philosophical point, because Chollet's position is consistent with humans being obsoleted and/or driven extinct whether or not the LLMs are "general intelligences". His position is that training LLMs results in an ever-larger number of learned algorithms and no ability to construct new algorithms. This is consistent with the possibility that, after some threshold of size and training, the LLM has learned every algorithm it needs to supplant humans in (say) 99.9% of cases. (It would definitely be going out with a whimper rather than a bang, on that hypothesis, to be out-competed by something that _really is_ just a gigantic lookup table!) |
![]() |
| Yeah and GPT4o was potentially trained on this test set and if the tried to hold it out it was still likely trained on discussions of the problems. |
![]() |
| François Chollet says LLMs do not learn in-context. But Geoff Hinton says LLMs' few-shot learning compares quite favorably with people!
https://www.youtube.com/watch?v=QWWgr2rN45o&t=46m20s The truth is in the middle, I think. They learn in-context, but not as well as humans. The approach in the article hides the unreliability of current LLMs by generating thousands of programs, and still the results aren't human-level. (This is impressive work though -- I'm not criticizing it.) |
![]() |
| I'm not sure how to quantify how quickly or well humans learn in-context (if you know of any work on this I'd love to read it!)
In general, there is too much fluff and confusion floating around about what these models are and are not capable of (regardless of the training mechanism.) I think more people need to read Song Mei's lovely slides[1] and related work by others. These slides are the best exposition I've found of neat ideas around ICL that researchers have been aware of for a while. [1] https://www.stat.berkeley.edu/~songmei/Presentation/Algorith... |
![]() |
| Amazing work, prompt engineering at its finest. One future direction for Arc AGI could be to use not Python, but a much more concise programming language that is more suited for brute-force methods like genetic mutations. The problem would be of course to train an LLM that is proficient enough in such a language. I am thinking about stack based languages. For this competition I would develop a careful bit-level encoding of a variant of the 'Joy' programming language. (https://en.wikipedia.org/wiki/Joy_(programming_language)) It would be a considerable effort though which I don't have time for, hence I post this idea publicly. A promising direction is a mix of things in my opinion: Special stack-based concise language, consulting LLMs like the OP did, and genetic algorithms combined.
|
![]() |
| "Vision is an especially large weakness."
But you can have GPT write code to reliably convert the image grid into a textual representation, right? And code to convert back to image and auto-verify. |
![]() |
| Arc agi is a small stepping stone to agi but is not agi.
Program search mimics what humans do to a certain extent but not in entirety. A more general world model and reference will be required for agi. |
![]() |
| The expectation is that you'll have to have dynamically generated benchmarks with better eval at some point given the potential for brute forcing the private validation set. |
![]() |
| It is necessary but not sufficient.
If you can't do ARC, you aren't general enough. But even if you can do ARC, you still might not be general enough. |
Ryan's work is legitimately interesting and novel "LLM reasoning" research! The core idea:
> get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s)
Roughly, he's implemented an outer loop and using 4o to sample reasoning traces/programs from training data and test. Hybrid DL + program synthesis approaches are solutions we'd love to see more of.
A couple important notes:
1. this result is on the public eval set vs private set (ARC Prize $).
2. the current private set SOTA ~35% solution also performed ~50% on the public set. so this new result might be SOTA but hasn't been validated or scrutinized yet.
All said, I do expect verified public set results to flow down to the private set over time. We'll be publishing all the SOTA scores and open source reproductions here once available: https://arcprize.org/leaderboard
EDIT: also, congrats and kudos to Ryan for achieving this and putting the effort in to document and share his approach. we hope to inspire more frontier AI research sharing like this