undefined

[-]

No, those aren't issues. But it's good to know the meaning of those numbers we get. For example, 25% is about the average human level (on this category of problems). 100% is either top human level or superhuman level or the information-theoretically optimal level.

by rolandhvar9 hours ago|

[-]

Sure, but, aim for the stars and you hit the moon right? Like fundamentally who cares? For the purpose of an AGI benchmark I'd argue you'd rather err on the side of being more intelligent and counting that as less intelligent than vice versa.

by girvo1 days ago|

[-]

Yeah I'm quite surprised as to how all of those are supposed to be considered problems. They all make sense to me if we're trying to judge whether these tools are AGI, no?

by andy12_1 days ago|

[-]

I think that any logic-based test that your average human can "fail" (aka, score below 50%) is not exactly testing for whether something is AGI or not. Though I suppose it depends on your definition of AGI (and whether all humans, or at least your average human, is considered AGI under that definition).

by chillfox23 hours ago|

[-]

If I had a puzzle I really needed solved, then I would not ask a rando on the street, I would ask someone I know is really good at puzzles.

My point is: For AGI to be useful, it really should be able to perform at the top 10% or better level for as many professions as possible (ideally all of them).

An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can.

by versteegen23 hours ago|

[-]

> An AI that can only perform at the average human level is useless unless it can be trained for the job like humans can.

Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity.

by sillysaurusx23 hours ago|

[-]

It seems they don't test for that, since they use the second-best human solution as a baseline.

And that's the right way to go. When computers were about to become superhuman at chess, few people cared that it could beat random people for many years prior to that. They cared when Kasparov was dethroned.

Remember, the point here is marketing as well as science. And the results speak for themselves. After all, you remember Deep Blue, and not the many runners-up that tried. The only reason you remember is because it beat Kasparov.

by swiftcoder16 hours ago|

[-]

> The only reason you remember is because it beat Kasparov

There is an additional fascinating aspect to these matches, in that Kasparov obviously knew he was facing a computer, and decided to play a number of sub-optimal openings because he hoped they might confound the computer's opening book.

It's not at all clear Deep Blue would have eked out the rematch victory had Kasparov respected it as an opponent, in the way he did various human grandmasters at the time.

by cubefox18 hours ago|

[-]

This is supposed to test for AGI, not ASI. ARC-AGI (later labelled "1") was supposed to detect AGI with a test that is easy for humans, not top humans.

by computably16 hours ago|

[-]

> Yes, if you want skilled labour. But that's not at all what ARC-AGI attempts to test for: it's testing for general intelligence as possessed by anyone without a mental incapacity.

Humans without a clinically recognized mental disability are generally capable of some kind of skilled labor. The "general" part of intelligence is independent of, but sufficient for, any such special application.

by 22 hours ago|

[-]

deleted

by benjaminl1 days ago|

[-]

This issue here is that people have different definitions of AGI. From the description. Getting 100% on this benchmark would be more than AGI and would qualify for ASI (Algorithmic Super Intelligence) not just AGI.

by fc417fc80223 hours ago|

[-]

If you only outdo humans 50% of the time you're never going to get consensus on if you've qualified. Whereas outdoing 90% of humans on 90% of all the most difficult tasks we could come up with is going to be difficult to argue against.

This benchmark is only one such task. After this one there's still the rest of that 90% to go.

Beating humans isn't anywhere near sufficient to qualify as ASI. That's an entirely different league with criteria that are even more vague.

by nearbuy17 hours ago|

[-]

Even dumb humans are considered to have general intelligence. If the bar is having to outdo the median human, then 50% of humans don't have general intelligence.

[-]

Not true. We don't have a good definition for intelligence - it's very much an I'll know it when I see it sort of thing.

Frontier models are reliably providing high undergraduate to low graduate level customized explanations of highly technical topics at this point. Yet I regularly catch them making errors that a human never would and which betray a fatal lack of any sort of mental model. What are we supposed to make of that?

It's an exceedingly weird situation we find ourselves in. These models can provide useful assistance to literal mathematicians yet simultaneously show clear evidence of lacking some sort of reasoning the details of which I find difficult to articulate. They also can't learn on the job whatsoever. Is that intelligence? Probably. But is it general? I don't think so, at least not in the sense that "AGI" implies to me.

Once humanity runs out of examples that reliably trip them up I'll agree that they're "general" to the same extent that humans are regardless of if we've figured out the secrets behind things such as cohesive world models, self awareness, active learning during operation, and theory of mind.

by nearbuy7 hours ago|

[-]

> Not true.

It's certainly true. By definition. If the bar for general intelligence is being smarter than the median human, 50% of people won't reach the threshold for general intelligence. (And if the bar is beating the median in every cognitive test, then a much smaller fraction of people would qualify.)

People don't have a consistent definition of AGI, and the definitions have changed over the past couple years, but I think most people have settled on it meaning at least as smart as humans in every cognitive area. But that has to be compared to dumb people, not median. We don't want to say that regular people don't have general intelligence.

by tadfisher6 hours ago|

[-]

You are using terms like "smart" and "dumb" as if they have universally-accepted definitions. You can make up as many definitions of intelligence as you like (I would argue that is a sign of intelligence) but using those terms is certainly going to lead to circular reasoning.

by LordDragonfang16 hours ago|

[-]

> Yet I regularly catch them making errors that a human never would

I have yet to see a "error" that modern frontier models make that I could not imagine a human making - average humans are way more error prone than the kind of person who posts here thinks, because the social sorting effects of intelligence are so strong you almost never actually interact with people more than a half standard deviation away. (The one exception is errors in spatial reasoning with things humans are intimately familiar with - for example, clothing - because LLMs live in literary space, not physics space, and only know about these things secondhand)

> and which betray a fatal lack of any sort of mental model.

This has not been a remotely credible claim for at least the past six months, and it seemed obviously untrue for probably a year before then. They clearly do have a mental model of things, it's just not one that maps cleanly to the model of a human who lives in 3D space. In fact, their model of how humans interact is so good that you forget that you're talking to something that has to infer rather than intuit how the physical world works, and then attribute failures of that model to not having one.

by throwaway0123_51 hours ago|

[-]

> I have yet to see a "error" that modern frontier models make that I could not imagine a human making

I mostly agree if "a human" is just any person we pluck of the street. What I still see with some regularity is the models (right now, primarily Opus 4.6 through Claude Code) making mistakes that humans:

- working in the same field/area as me (nothing particularly exotic, subfield of CS, not theory)

- with even a fraction of the declarative knowledge about the field as the LLM

- with even a fraction of frontier LLM abilities suggested by their perf in mathematical/informatics Olympiads

would never make. Basically, errors I'd never expect to see from a human coworker (or myself). I don't yet consider myself an expert in my subfield, and I'll almost certainly never be a top expert in it. Often the errors seem to present to me as just "really atrocious intuition." If the LLM ran with some of them they would cause huge problems.

In many regards the models are clearly superhuman already.

by fc417fc80210 hours ago|

[-]

> you almost never actually interact with people more than a half standard deviation away

I wasn't talking about the average person there but rather those who could also craft the high undergrad to low grad level explanations I referred to.

> This has not been a remotely credible claim for at least the past six months

Well it's happened to me within the past six months (actually within the past month) so I don't know what you want from me. I wasn't claiming that they never exhibit evidence of a mental model (can't prove a negative anyhow). There are cases where they have rendered a detailed explanation to me yet there were issues with it that you simply could not make if you had a working mental model of the subject that matched the level of the explanation provided (IMO obviously). Imagine a toddler spewing a quantum mechanics textbook at you but then uttering something completely absurd that reveals an inherent lack of understanding; not a minor slip up but a fundamental lack of comprehension. Like I said it's really weird and I'm not sure what to make of it nor how to properly articulate the details.

I'm aware it's not a rigorous claim. I have no idea how you'd go about characterizing the phenomenon.

by winwang5 hours ago|

[-]

How much of this is expectations setting by the heights models reach? i.e. of we could assess a consistent floor of model performance in a vacuum, would we say it's better at "AGI" than the bottom 0.1% of humans?

by fc417fc80256 minutes ago|

[-]

Not sure how to answer because we were off on a tangent there about mental models.

I think AGI is two things. Intelligence at a given task, which can be scored versus humans or otherwise. And generalization which is entirely separate. We already have superhuman non-general models in a few domains.

So I don't think that "better than AGI at % of humans" is a sensible statement, at least not initially.

Right now humans generalize to all integers while AI companies keep manually adding additional integers to a finite list and bystanders make claims of generality. If you've still got a finite list you aren't general regardless of how long the list is.

If at some point a model shows up that works on all even integers but not odd ones then I guess you could reasonably claim you had AGI that was 50% of what humans achieve. If a model that generalizes to all the reals shows up then it will have exceeded human generality by an infinite degree. We'll cross those bridges when we come to them - I don't think we're there yet.

by winwang46 minutes ago|

[-]

Interestingly, I find that the models generalize decently well as long as the "training" (more analogous to that for humans) fits in (small enough) context. That's to say, "in-context learning" seems good enough for real use.

But of course, that's not quite "long term"

by charcircuit16 hours ago|

[-]

I think you are getting caught up on the intelligence part. That is the easy part since AGI doesn't have to be intelligent, it just has to be intelligence. If you look at early chess AI you will see that they are very weak compared to even a beginner human. The level of intelligence does not matter for a chess bot to be considered AI. It is that it is emulating intelligence that makes it AI.

>But is it general? I don't think so

I would consider it as general due to me being able to take any problem I can think of and the AI will make an attempt to solve it. Actually solving it is not a requirement for AGI. Being able to solve it just makes it smarter than an AGI that can't. You can trip up chess AI, but that don't stop them from being AI. So why apply that standard to AGI?

by fc417fc80210 hours ago|

[-]

How am I getting caught up on it? I acknowledged that I think frontier models qualify as intelligent but disputed the "general" part. In fact for quite a few years now there have been many non-frontier models that I also consider intelligent within a very narrow domain.

I think stockfish reasonably qualifies as superhuman AI but not even remotely "general". Similarly alphafold.

> Actually solving it is not a requirement for AGI.

I think I see what you're trying to get at but taken as worded that can't possibly be right. Otherwise a dumb-as-a-brick automaton that made an "attempt" to tackle whatever you put in front of it would qualify as AGI.

by charcircuit6 hours ago|

[-]

>Otherwise a dumb-as-a-brick automaton that made an "attempt" to tackle whatever you put in front of it would qualify as AGI.

I would agree as long as there is a general mechanism to represent problems. It is AGI, but would perform poorly on benchmarks compared to better AGI.

by foltik23 hours ago|

[-]

I’d be hesitant to call that ASI if it’s pretty obvious how you’d write a regular old program to solve it.

by nopinsight18 hours ago|

[-]

It’s not that simple since each problem is supposed to be distinct and different enough that no single program can solve multiple of them properly. No problem spec is provided as well iiuc so you can’t simply ask an LLM to generate code without doing other things.

[-]

A human can sit down to play a game with unknown rules and write a spec as he goes. If a model can't even figure out to attempt that, let alone succeed at it, then it most certainly isn't an example of "general" intelligence.

by LordDragonfang15 hours ago|

[-]

> A human can sit down to play a game with unknown rules and write a spec as he goes.

Some humans can. Many, if not most humans cannot. A significant enough fraction of humans have trouble putting together Ikea furniture that there are memes about its difficulty. You're vastly overestimating the capabilities of the average human. Working in tech puts you in probably the top ~1-5% of capability to intuit and understand rules, but it distorts your intuition of what a "reasonable" baseline for that is.

by fc417fc80210 hours ago|

[-]

Yes, I am aware. However an idealized human can do so. Analogously, there are plenty of humans that can't run an 8 minute mile but if your bipedal robot is physically incapable of ever doing that then it isn't reasonable to claim having achieved human level athletic performance. When it can compete in every Olympic event you can claim human level performance at athletics in general.

If the model can't generalize to arbitrary tasks on its own without any assistance then it doesn't qualify as a general intelligence. AGI to my mind means meeting or exceeding idealized human performance on the vast majority of arbitrary tasks that are cherrypicked to be particularly challenging.

by cubefox18 hours ago|

[-]

It's not obvious at all. And I would say pretty much impossible without using machine learning. Even for ARC-AGI-1 there is no GOFAI program that scores high.

by 18 hours ago|

[-]

deleted

by throwuxiytayq23 hours ago|

[-]

People are still debating whether these models exhibit any kind of intelligence and any kind of thinking. Setting the bar higher then necessary is welcome, but at this point I’m pretty sure everyone’s opinions are set in stone.

by iLoveOncall14 hours ago|

https://web.archive.org/web/20150108000749/https://en.wikipe...

[-]

There's a single true definition of AGI, open the page about AGI on Wikipedia but using archive.org on a snapshot from 10 years ago.

All the rest is bullshit made up by LLM labs to make it seem like they hit AGI by dumbing down its definition.

by LordDragonfang15 hours ago|

https://x.com/mgubrud/status/2036262415634153624

[-]

In retrospect, it seems obvious that we hit AGI by a reasonable "at least as intelligent as some humans" definition when o3 came out, and everything since then has been goalpost moving by people who have higher and higher bars for which percentile human they would be willing to employ (or consider intellectually capable). People should really just use the term "ASI" when their definition of AGI excludes the majority of humans.

Edit: Here's the guy who coined the term saying we're already there. Everything else is arguing over definitions.

> Well, Lars, I INVENTED THE TERM and I say we have achieved AGI. Current models perform at roughly high-human level in command of language and general knowledge, but work thousands of times faster than us. Still some major deficiencies remain but they're falling fast.

by naasking10 hours ago|

[-]

> They all make sense to me if we're trying to judge whether these tools are AGI, no?

As long as the mean and median human scores are clearly communicated, the scoring is fine. I think the human scores above would surprise people at first glance, even if they make sense once you think about it, so there's an argument to be made that scores can be misleading.

by stingraycharles14 hours ago|

[-]

“no harnass at all” might be an issue, though, as these types of benchmarks are often gamified and then models perform great on them without actually being better models.

by stonogo22 hours ago|

[-]

They are severe problems if your income is tied to LLM hype generation.

by NitpickLawyer1 days ago|

[-]

> No harness at all and very simplistic prompt

TBF, that's basically what the kaggle competition is for. Take whatever they do, plug in a SotA LLM and it should do better than whatever people can do with limited GPUs and open models.

by bee_rider23 hours ago|

[-]

Defining the baseline human is always a bit arbitrary. The median human is illiterate and also dead.

by esailija17 hours ago|

[-]

It actually makes sense. For any task it is completely trivial for anyone to become better than >80% humans and still easy to be better than >95%. The only problem is motivation not intelligence.

by codeinred5 hours ago|

[-]

We're at the point where LLMs and coding agents are supposed to do higher-level work. It makes sense to benchmark them against top human performance, rather than average human performance, because at specialized tasks, average human performance isn't enough.

The issues you described seem like they're actually strengths of the benchmark.

by fchollet1 days ago|

[-]

Francois here. The scoring metric design choices are detailed in the technical report: https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf - the metric is meant to discount brute-force attempts and to reward solving harder levels instead of the tutorial levels. The formula is inspired by the SPL metric from robotics navigation, it's pretty standard, not a brand new thing.

We tested ~500 humans over 90 minute sessions in SF, with $115-$140 show up fee (then +$5/game solved). A large fraction of testers were unemployed or under-employed. It's not like we tested Stanford grad students. Many AI benchmarks use experts with Ph.D.s as their baseline -- we hire regular folks as our testers.

Each game was seen by 10 people. They were fully solved (all levels cleared) by 2-8 of them, most of the time 5+. Our human baseline is the second best action count, which is considerably less than an optimal first-play (even the #1 human action count is much less than optimal). It is very achievable, and most people on this board would significantly outperform it.

Try the games yourself if you want to get a sense of the difficulty.

> Models can't use more than 5X the steps that a human used

These aren't "steps" but in-game actions. The model can use as much compute or tools as it wants behind the API. Given that models are scored on efficiency compared to humans, the cutoff makes basically no difference on the final score. The cutoff only exists because these runs are incredibly expensive.

> No harness at all and very simplistic prompt

This is explained in the paper. Quoting: "We see general intelligence as the ability to deal with problems that the system was not specifically designed or trained for. This means that the official leaderboard will seek to discount score increases that come from direct targeting of ARC-AGI-3, to the extent possible."

...

"We know that by injecting a high amount of human instructions into a harness, or even hand-crafting harness configuration choices such as which tools to use, it is possible to artificially increase performance on ARC-AGI-3 (without improving performance on any other domain). The purpose of ARC-AGI-3 is not to measure the amount of human intelligence that went into designing an ARC-AGI-3 specific system, but rather to measure the general intelligence of frontier AI systems.

...

"Therefore, we will focus on reporting the performance of systems that have not been specially prepared for ARC-AGI-3, served behind a general-purpose API (representing developer-aware generalization on a new domain as per (8)). This is similar to looking at the performance of a human test-taker walking into our testing center for the first time, with no prior knowledge of ARC-AGI-3. We know such test takers can indeed solve ARC-AGI-3 environments upon first contact, without prior training, without being briefed on solving strategies, and without using external tools."

If it's AGI, it doesn't need human intervention to adapt to a new task. If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.

by Imnimo1 days ago|

[-]

Suppose you construct a Mechanical Turk AI who plays ARC-AGI-3 by, for each task, randomly selecting one of the human players who attempted it, and scoring them as an AI taking those same actions would be scored. What score does this Turk get? It must be <100% since sometimes the random human will take more steps than the second best, but without knowing whether it's 90% or 50% it's very hard for me to contextualize AI scores on this benchmark.

by thereitgoes45619 hours ago|

[-]

The people recruited weren’t experts. I can imagine it’s straightforward to find humans (such as those that play many video games) that can score >100% on this benchmark.

by Imnimo19 hours ago|

[-]

So, if you look at the way the scoring works, 100% is the max. For each task, you get full credit if you solve in a number of steps less than or equal to the baseline. If you solve it with more steps, you get points off. But each task is scored independently, and you can't "make up" for solving one slowly by solving another quickly.

Like suppose there were only two tasks, each with a baseline score of solving in 100 steps. You come along and you solve one in only 50 steps, and the other in 200 steps. You might hope that since you solved one twice as quickly as the baseline, but the other twice as slowly, those would balance out and you'd get full credit. Instead, your scores are 1.0 for the first task, and 0.25 (scoring is quadratic) for the second task, and your total benchmark score is a mere 0.625.

[-]

The purpose is to benchmark both generality and intelligence. "Making up for" a poor score on one test with an excellent score on another would be the opposite of generality. There's a ceiling based on how consistent the performance is across all tasks.

[-]

Thanks, I mostly agree with your approach except for one thing: eyesight feels like a "harness" that humans get to use and LLMs do not.

I'm guessing you did not pass the human testers JSON blobs to work with, and suspect they would also score 0% without the eyesight and visual cortex harness to their reasoning ability.

by fchollet1 days ago|

[-]

I'm all for testing humans and AI on a fair basis; how about we restrict testing to robots physically coming to our testing center to solve the environments via keyboard / mouse / screen like our human testers? ;-)

(This version of the benchmark would be several orders of magnitude harder wrt current capabilities...)

by stalfie15 hours ago|

[-]

This counterpoint doesn't address the issue, and I would argue that it is partially bad faith.

Yes, making it to the test center is significantly harder, but in fact the humans could have solved it from their home PC instead, and performed the exact same. However, if they were given the same test as the LLMs, forbidden from input beyond JSON, they would have failed. And although buying robots to do the test is unfeasible, giving LLMs a screenshot is easy.

Without visual input for LLMs in a benchmark that humans are asked to solve visually, you are not comparing apples to apples. In fact, LLMs are given a different and significantly harder task, and in a benchmark that is so heavily weighted against the top human baseline, the benchmark starts to mean something extremely different. Essentially, if LLMs eventually match human performance on this benchmark, this will mean that they in fact exceed human performance by some unknown factor, seeing as human JSON performance is not measured.

Personally, this hugely decreased my enthusiasm for the benchmark. If your benchmark is to be a North star to AGI, labs should not be steered towards optimizing superhuman JSON parsing skills. It is much more interesting to steer them towards visual understanding, which is what will actually lead the models out into the world.

by stalfie12 hours ago|

[-]

I just realized that this also means that the benchmark is in practice unverified by third parties, as all tasks are not verified to be solvable through the JSON interface. Essentially there is no guarantee that it is even possible to understand how to complete every task optimally through the JSON interface alone.

I assume you did not develop the puzzles by visualizing JSON yourselves, and so there might be non obvious information that is lost in translation to JSON. Until humans optimally solve all the puzzles without ever having seen the visual version, there is no guarantee that this is even possible to do.

I think the only viable solution here is to release a version of the benchmark with a vision only harness. Otherwise it is impossible to interpret what LLM progress on this benchmark actually means.

by stalfie3 hours ago|

[-]

Oookay. I actually tried the harness myself, and there was a visual option. It is unclear to me if that is what the models are using on the official benchmark, but it probably is. This probably means that much of my critique is invalid. However, in the process of fiddling with the harness, building a live viewer to see what was happening, and playing through the agent API myself, I might have found 3-4 bugs with the default harness/API. Dunno where to post it, so of all places I am documenting the process on HN.

Bug 1: The visual mode "diff" image is always black, even if the model clicked on an interactive element and there was a change. Codex fixed it in one shot, the problem was in the main session loop at agent.py (line 458).

Bug 2: Claude and Chatgpt can't see the 128x128 pixel images clearly, and cannot or accurately place clicks on them either. Scaling up the images to 1028x1028 pixels gave the best results, claude dropped off hard at 2048 for some reason. Here are the full test results when models were asked to hit specific (manually labeled) elements on the "vc 33" level 1 (upper blue square, lower blue square, upper yellow rectangle, lower yellow rectangle):

Model | 128 | 256 | 512 | 1024 | 2048

claude-opus-4-6 | 1/10 | 1/10 | 9/10 | 10/10 | 0/10

gemini-3-1-pro-preview | 10/10 | 10/10 | 10/10 | 10/10 | 10/10

gpt-5.4-medium | 4/10 | 8/10 | 9/10 | 10/10 | 8/10

Bug 3: "vc 33" level 4 is impossible to complete via the API. At least it was when I made a web-viewer to navigate the games from the API side. The "canal lock" required two clicks instead of one to transfer the "boat" when water level were equilibriated, and after that any action whatsoever would spontaneously pop the boat back to the first column, so you could never progress.

"Bug" 4: This is more of a complaint on the models behalf. A major issue is that the models never get to know where they clicked. This is truly a bit unfair since humans get a live update of the position of their cursor at no extra cost (even a preview of the square their cursor highlights in the human version), but models if models fuck up on the coordinates they often think they hit their intended targets even though they whiffed the coordinates. So if that happens they note down "I hit the blue square but I guess nothing happened", and for the rest of the run they are fucked because they conclude the element is not interactive even though they got it right on the first try. The combination of an intermediary harness layer that let the models "preview" their cursor position before the "confirmed" their action and the 1024x1024 resolution caused a major improvement in their intended action "I want to click the blue square" actually resulting in that action. However, even then unintended miss-clicks often spell the end of a run (Claude 4.6 made it the furthest, which means level 2 of the "vc 33" stages, and got stuck when it missed a button and spent too much time hitting other things)

After I tried to fix all of the above issues, and tried to set up an optimal environment for models to get a fair shake, the models still mostly did very badly even when they identified the right interactive elements...except for Claude 4.6 Opus! Claude had at least one run where it made it to level 4 on "vc 33", but then got stuck because the blue squares it had to hit became too small, and it just couldn't get the cursor in the right spot even with the cursor preview functionality (the guiding pixel likely became too small for it to see clearly). When you read through the reasoning for the previous stages though, it didn't truly fully understand the underlying logic of the game, although it was almost there.

[-]

Well, yes, and would hand even more of an advantage to humans. My point is that designing a test around human advantages seems odd and orthogonal to measuring AGI.

by adgjlsfhk11 days ago|

[-]

The whole point of AGI is "general" intelligence, and for that intelligence to be broadly useful it needs to exist within the context of a human centric world

by Marha0118 hours ago|

[-]

Does this mean blind people are not intelligent?

by degamad16 hours ago|

[-]

Blind people do function within the context of a human-centric world, though, so they would qualify as intelligent.

by Marha018 hours ago|

[-]

Yes, but they use various "harnesses" to do so (dog guides, text to speech software, assistance of other humans when needed..). Why can't AI?

by Rastonbury1 hours ago|

[-]

Assistance of other humans? You do realise we're talking about an intelligence test right, at that point what are you even testing for. I'm sure you've taken exams where you couldn't bring your own notes, use Google or get help from someone, even though real life doesn't have those constraints

[-]

Then why deny it a harness it can also use in a human centric world?

by getnormality23 hours ago|

[-]

There is no general purpose harness.

by scotty791 days ago|

[-]

General intelligence not owning retinas.

Denying proper eyesight harness is like trying to construct speech-to-text model that makes transcripts from air pressure values measured 16k times per second, while human ear does frequency-power measurement and frequency binning due to it's physical construction.

by fc417fc8021 days ago|

[-]

The human testers were provided with their customary inputs, as were the LLMs. I don't see the issue.

I guess it could be interesting to provide alternative versions that made available various representations of the same data. Still, I'd expect any AGI to be capable of ingesting more or less any plaintext representation interchangeably.

[-]

The issue is that ARC AGI 3 specifically forbids harnesses that humans get to use.

by 1 days ago|

[-]

deleted

by fc417fc80223 hours ago|

[-]

So what? Are you suggesting that an agent exhibiting genuine AGI will be tripped up by having to ingest json rather than rgb pixels? LLMs are largely trained on textual data so json is going to be much closer to whatever native is for them.

But by all means, give the agents access to an API that returns pixel data. However I fully expect that would reduce performance rather than increase it.

[-]

Because it is. Opus 4.6 jumps from 0.0% to 97.1% when given visual input

[-]

That's impressive. I'm also a bit surprised - I wouldn't have expected it to be trained much at all on that sort of visual input task. I think I'd be similarly surprised to learn that a frontier model was particularly good at playing retro videogames or actuating a robot for example.

However, if it can't figure out to render the json to a visual on its own does it really qualify as AGI? I'd still say the benchmark is doing its job here. Granted it's not a perfectly even playing field in that case but I think the goal is to test for progress towards AGI as opposed to hosting a fair tournament.

by rfoo16 hours ago|

[-]

> However, if it can't figure out to render the json to a visual on its own does it really qualify as AGI? I'd still say the benchmark is doing its job here.

Can you render serialized JSON text blob to a visual with your brain only? The model can't do anything better than this - no harness means no tool at all, no way to e.g. implement a visualizer in whatever programming language and run it.

Why don't human testers receive the same JSON text blob and no visualizers? It's like giving human testers a harness (a playable visualizer), but deliberately cripples it for the model.

[-]

Huh. I thought it wasn't supposed to receive any instructions tailored to the task but I didn't understand it to be restricted from accessing truly general tools such as programming languages. To do otherwise is to require pointless hoop jumping as frontier models inevitably get retrained to play games using a json (or other arbitrary) representation at which point it will be natural for them and the real test will begin.

by frotaur10 hours ago|

[-]

This is my understanding as well, I thought tools where allowed.

[-]

Source? I haven't seen anything like that for ARC-AGI performance.

Also, if it makes that big of a difference, then make a renderer for your agent that looks like the web page and have it solve them in the graphical interface and funnel the results to the API. I guarantee you won't get better performance, because the AGI is going to have to "understand" the raw data can be represented as a 2D matrix regardless of whether it gets a 2D matrix of pixels or a 2D matrix of enumeration in JSON. If anything, that makes it a more difficult problem for a AI system that "speaks" in tokens.

by famouswaffles7 hours ago|

[1] https://arcprize.org/media/ARC_AGI_3_Technical_Report.pdf

[-]

That score is in the arc technical paper [1]. It's the full benchmark score using this harness [2] (which is just open code with read, grep, bash tools).

This is already a solved benchmark. That's why scoring is so convoluted and a self proclaimed Agent benchmark won't allow basic agent tools. ARC has always been a bit of a nothing burger of a benchmark but this takes the cake.

[2] https://blog.alexisfox.dev/arcagi3

by vbarrielle7 hours ago|

[-]

> For example, in a variant of environment TR87, Opus 4.6 scores 0.0% with no harness and 97.1% with the Duke harness (12), yet in environment BP35, Opus 4.6 scores 0.0% under both configuration

This is with a harness that has been designed to tackle "a small set of public environments: ls20, ft09, and vc33" (of the arc-agi-3 challenge), yet it looks like it does not solve the full arc-agi-3 benchmark, just some of them.

by famouswaffles7 hours ago|

[-]

The harness was designed with the preview, but no it was still tested on the full public set in that environment. You can run the benchmark in different 'environments' though it's unclear what the difference between them is.

>We then tested the harnesses on the full public set (which researchers did not have access to at the time)

by daveguy2 hours ago|

+ https://arxiv.org/abs/1911.01547

[-]

It may have been tested on the full set, but the score you quote is for a single game environment. Not the full public set. That fact is verbatim in what you responded to and vbarrielle quoted. It scored 97% in one game, and 0% in another game. The full prelude to what vbarrielle quoted, the last sentence of which you left out, was:

> We then tested the harnesses on the full public set (which researchers did not have access to at the time). We found extreme bimodal performance across the two sets, controlling for the same frontier model...

The harness only transfers to like-environments and the intelligence for those specific games is baked into the harness by the humans who coded it for this specific challenge.

The point of ARC-AGI is to test the intelligence of AI systems in novel, but simple, environments. Having a human give it more powerful tools in a harness defeats the purpose. You should go back and read the original ARC-AGI paper to see what this is about+. Are you upset about the benchmark because frontier LLM models do so poorly exhibiting the ability to generalize when the benchmarks are released?

by fc417fc8021 hours ago|

[-]

> intelligence for those specific games is baked into the harness

This is your claim but the other commenter claims the harness consists only of generic tools. What's the reality?

I also encountered confusion about this exact issue in another subthread. I had thought that generic tooling was allowed but others believed the benchmark to be limited to ingesting the raw text directly from the API without access to any agent environment however generic it might be.

by levocardia20 hours ago|

[-]

My sense is that a powerful enough AI would have the sense to think something like "ah, this sounds like a video game! Let me code up an interactive GUI, test it for myself, then use it to solve these puzzles..." and essentially self-harness (the way you would if you were reading a geometry problem, by drawing it out on paper).

[-]

Yeah but thats literally above ASI, let alone AGI. Average human scores <1% on this bench, opus scores 97.1% when given an actual vision access, which means agi was long ago achieved

by vova_hn213 hours ago|

[-]

> opus scores 97.1% when given an actual vision access

Do you have a source for this? I would be very curious to see how top models do with vision.

by famouswaffles7 hours ago|

https://news.ycombinator.com/item?id=47532483

[-]

[-]

No, there is no source for this. Opus is scoring around 1% just like all the other frontier models. It would be fairly trivial to add a renderer intermediary. And if it improves to 97+%... Then you would get a huge cut of $2 million dollars. The assertion that Opus gets 97% if you just give it a gui is completely bogus.

by 1 days ago|

[-]

deleted

by blueblisters1 days ago|

[-]

I tried ls20 and it was surprisingly fun! Just from a game design POV, these are very well made.

Nit: I didn't see a final score of how many actions I took to complete 7 levels. Also didn't see a place to sign in to see the leaderboard (I did see the sign in prompt).

by TarasBob17 hours ago|

[-]

Agree 100%. I want to be able to see how many actions it took me. And it would be good if it were possible to see how well I'm doing compared to other humans, i.e. what is my percentile.

by sanxiyn18 hours ago|

[-]

While I think all of your design choices are defensible, I do think you should release the full human baseline data. The second best action count is fine, but other choices are reasonable as well.

by antirez11 hours ago|

[-]

> If a harness is needed, it can make its own. If tools are needed, it can chose to bring out these tools.

If I understand correctly the model can carry only very limited memory among tests, so it looks like it's not really possible for the model to self specialize itself under this assumptions.

by Veedrac23 hours ago|

[-]

There's a very simple solution to this problem here. Instead of wink-wink-nudge-nudge implying that 100% is 'human baseline', calculate the median human score from the data you already have and put it on that chart.

[-]

Its below 1% lmao

by chotmat16 hours ago|

[-]

where did you get this 1%?

by strongpigeon1 days ago|

[-]

Something that I don't understand after reading the technical report is: Why is having access to a python interpreter as part of the harness not allowed (like the Duke harness), but using one hidden behind the model API (as a built-in tool) considered kosher?

by cdetrio1 days ago|

[-]

The Duke harness was specifically designed for these puzzles, that's why they don't want to measure it.

My reading of that part in the technical report (models "could be using their own tools behind the model’s API, which is a blackbox"), is that there's no way to prevent it.

But from fchollet's comment here, using tools and harnesses is encouraged, as long as they are generic and not arc-agi specific. In that case, the models should be benchmarked by prompting through claude code and codex, rather than the through API (as from the api we only expect raw LLM output, and no tool use).

by FINDarkside1 days ago|

[-]

OpenAi does have python execution behind general purpose api, but it has to be enabled with a flag so I don't think it was used.

by Marha0118 hours ago|

[-]

Don't you see the massive problem with requiring visual input? Are blind people not intelligent because they cannot solve ARC-AGI-3 without a "harness"?

A theoretical text-only superintelligent LLM could prove the Riemann hypothesis but fail ARC-AGI-3 and won't even be AGI according to this benchmark...

by notnullorvoid5 hours ago|

[-]

Think of it as spatial input, not visual. Blind people do have spatial inputs, and high spatial intelligence.

by cubefox18 hours ago|

[-]

Well, it would be AGI if you could connect a camera to it to solve it, similar to how blind people would be able to solve it if you restored their eyesight. But if the lack of vision is a fundamental limitation of their architecture, then it seems more fair not to call them AGI.

by Marha0117 hours ago|

[-]

People blind from birth literally lack the neural circuits to comprehend visual data. Are they not intelligent?

by maldev8 hours ago|

[-]

I think I can confidently say they are not visually intelligent at all.

If you were phrasing things to quantify intelligence, you would have a visual intelligence pillar. And they would not pass that pillar. It doesn't make them dysfunctional or stupid, but visual intelligence is a key part of human intelligence.

by notnullorvoid5 hours ago|

[-]

Visual intelligence is a near meaningless term as it's almost entirely dependant on spatial intelligence. The visually impaired do have high spatial intelligence, I wouldn't be surprised if their spatial intelligence is actually higher on average than those without visual impairment.

by cubefox16 hours ago|

[-]

I think they don't actually lack them, or lack only a small fraction (their brains are ≈99% like a normal human brain), such that if they were an AI model, they could be fairly trivially upgraded with vision capability.

by WarmWash1 days ago|

[-]

Maybe this is a neither can confirm or deny thing, but are there systems in place or design decisions made that are meant to surface attempts at benchmark optimizing (benchmaxxing), outside of just having private sets? Something like a heuristic anti-cheat I suppose.

Or perhaps the view is that any gains are good gains? Like studying for a test by leaning on brute memorization is still a non-zero positive gain.

by fchollet1 days ago|

[-]

There are no tricks. Our approach to reducing the impact of targeting (without fully eliminating it) is described in the paper.

by cdetrio1 days ago|

[-]

Are you prompting the models through their APIs, which are not designed to use tools or harnesses? Or do the "system prompt" results come from prompting into the applications (i.e. claude code, or codex, or even the web front-ends)?

by GodelNumbering1 days ago|

[-]

Off topic but I have been following your Twitter for a while and your posts specifically about the nature of intelligence have been a read.

[-]

New benchmark idea: 20 questions of guess the number 1-10, with different answers. We run this on 10,000 humans, take best score. Then we take 50 ai attempts, but take the worst attempt so "worst case scenarior robustness or so". We also discard questions where human failed but ai passed because uhhh reasons... Then we also take the final relative score to the power of 100 so that the benchmark punishes bad answers or sum. Good benchmark?

[-]

This is a gross misrepresentation of the scoring process.

by dyauspitr1 days ago|

[-]

If anything this makes the test much harder for the LLM to get high scores and that makes the scores they’re getting all that much more impressive.

by daveguy10 hours ago|

[-]

The scroes they're getting are on the order of 0-1% for this ARC-AGI-3 benchmark.

by Marazan1 days ago|

[-]

"Very simplistic prompt" is the absolute and total core of this and the thing that ensures validity of the whole exercise.

If you are trying to measure GENERAL intelligence then it needs to be general.

by littlestymaar15 hours ago|

[-]

Like other ARC-AGI challenges it was never needed to reach 100% to get human-level. The benchmark score is stretched so that the benchmark takes more time to be saturated, that's it.

The current SotA models are still very far from your hypothetical “average human” with a score of 3%. So the benchmark is indeed useful to help the field progress (which is the entire point of ARC-AGI benchmarks).

by theLiminator1 days ago|

[-]

Lol basically we're saying AI isn't AI if we utilize the strength of computers (being able to compute). There's no reason why AGI should have to be as "sample efficient" as humans if it can achieve the same result in less time.

by pptr1 days ago|

[-]

Let's say an agent needs to do 10 brain surgeries on a human to remove a tumor and a human doctor can do it in a single surgery. I would prefer the human.

"steps" are important to optimize if they have negative externalities.

by ACCount371 days ago|

[-]

It's kind of the point? To test AI where it's weak instead of where it's strong.

"Sample efficient rule inference where AI gets to control the sampling" seems like a good capability to have. Would be useful for science, for example. I'm more concerned by its overreliance on humanlike spatial priors, really.

by famouswaffles1 days ago|

[-]

ARC has always had that problem but for this round, the score is just too convoluted to be meaningful. I want to know how well the models can solve the problem. I may want to know how 'efficient' they are, but really I don't care if they're solving it in reasonable clock time and/or cost. I certainly do not want them jumbled into one messy convoluted score.

'Reasoning steps' here is just arbitrary and meaningless. Not only is there no utility to it unlike the above 2 but it's just incredibly silly to me to think we should be directly comparing something like that with entities operating in wildly different substrates.

If I can't look at the score and immediately get a good idea of where things stand, then throw it way. 5% here could mean anything from 'solving only a tiny fraction of problems' to "solving everything correctly but with more 'reasoning steps' than the best human scores." Literally wildly different implications. What use is a score like that ?

by pants21 days ago|

[-]

The measurement metric is in-game steps. Unlimited reasoning between steps is fine.

This makes sense to me. Most actions have some cost associated, and as another poster stated it's not interesting to let models brute-force a solution with millions of steps.

by famouswaffles1 days ago|

[-]

Same thing in this case. No Utility and just as arbitrary. None of the issues with the score change.

Models do not brute force solutions in that manner. If they did, we'd wait the lifetimes of several universes before we could expect a significant result.

Regardless, since there's a x5 step cuttof, 'brute forcing with millions of steps' was never on the table.

by thereitgoes4561 days ago|

[-]

The metric is very similar to cost. It seems odd to justify one and not the other.

by famouswaffles1 days ago|

[-]

Cost has utility in the real world and this doesn't. That's the only reason i would tolerate thinking about cost, and even then, i would never bundle it into the same score as the intelligence, because that's just silly.

by jstummbillig1 days ago|

[-]

It's an interesting point but I too find it questionable. Humans operate differently than machines. We don't design CPU benchmarks around how humans would approach a given computation. It's not entirely obvious why we would do it here (but it might still be a good idea, I am curious).

by cyanydeez1 days ago|

[-]

I think your logic isn't sound: Wouldn't we want a "intelligence" to solve problems efficiently rather than brute force a million monkies? There's defnitely a limit to compute, the same ways there's a limit to how much oil we can use, etc.

In theory, sure, if I can throw a million monkies and ramble into a problem solution, it doesnt matter how I got there. In practice though, every attempt has a direct and indirect impact on the externalities. You can argue those externalities are minor, but the largesse of money going to data centers suggests otherwise.

Lastly, humans use way less energy to solve these in fewer steps, so of course it matter when you throw Killowatts at something that takes milliwatts to solve.

by diego_sandoval1 days ago|

[-]

> Lastly, humans use way less energy to solve these in fewer steps,

Not if you count all the energy that was necessary to feed, shelter and keep the the human at his preferred temperature so that he can sit in front of a computer and solve the problem.

by cyanydeez1 days ago|

[-]

ok, but thats the same for bulding a data center.

Try again.

by gunalx1 days ago|

[-]

Yes, especially when considering a dataceter needed the energy of pretty many people to be built.

A single human is indeed more efficent, and way more flexible and actually just general intelligence.

by fsdf21 days ago|