undefined

[-]

Per BalatroBench, gemini-3-pro-preview makes it to round (not ante) 19.3 ± 6.8 on the lowest difficulty on the deck aimed at new players. Round 24 is ante 8's final round. Per BalatroBench, this includes giving the LLM a strategy guide, which first-time players do not have. Gemini isn't even emitting legal moves 100% of the time.

by ankit2191 hours ago|

[-]

Agreed. Gemini 3 Pro for me has always felt like it has had a pretraining alpha if you will. And many data points continue to support that. Even as flash, which was post trained with different techniques than pro is good or equivalent at tasks which require post training, occasionally even beating pro. (eg: in apex bench from mercor, which is basically a tool calling test - simplifying - flash beats pro). The score on arc agi2 is another datapoint in the same direction. Deepthink is sort of parallel test time compute with some level of distilling and refinement from certain trajectories (guessing based on my usage and understanding) same as gpt-5.2-pro and can extract more because of pretraining datasets.

(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)

by ebiester3 hours ago|

[-]

It's trained on YouTube data. It's going to get roffle and drspectred at the very least.

by silver_sun3 hours ago|

[-]

Google has a library of millions of scanned books from their Google Books project that started in 2004. I think we have reason to believe that there are more than a few books about effectively playing different traditional card games in there, and that an LLM trained with that dataset could generalize to understand how to play Balatro from a text description.

Nonetheless I still think it's impressive that we have LLMs that can just do this now.

by mjamesaustin2 hours ago|

[-]

Winning in Balatro has very little to do with understanding how to play traditional poker. Yes, you do need a basic knowledge of different types of poker hands, but the strategy for succeeding in the game is almost entirely unrelated to poker strategy.

by gilrain3 hours ago|

[-]

If it tried to play Balatro using knowledge of, e.g., poker, it would lose badly rather than win. Have you played?

by gcr2 hours ago|

[-]

I think I weakly disagree. Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.

by barnas22 hours ago|

[-]

>Poker players have intuitive sense of the statistics of various hand types showing up, for instance, and that can be a useful clue as to which build types are promising.

Maybe in the early rounds, but deck fixing (e.g. Hanged Man, Immolate, Trading Card, DNA, etc) quickly changes that. Especially when pushing for "secret" hands like the 5 of a kind, flush 5, or flush house.

by winstonp4 hours ago|

[-]

DeepSeek hasn't been SotA in at least 12 calendar months, which might as well be a decade in LLM years

by cachius4 hours ago|

[-]

What about Kimi and GLM?

by zozbot2343 hours ago|

[-]

These are well behind the general state of the art (1yr or so), though they're arguably the best openly-available models.

by epolanski31 minutes ago|

[-]

Idk man, GLM 5 in my tests matches opus 4.5 which is what, two months old?

by tgrowazay2 hours ago|

[-]

According to artificial analysis ranking, GLM-5 is at #4 after Claude Opus 4.5, GPT-5.2-xhigh and Claude Opus 4.6 .

by dudisubekti4 hours ago|

[-]

But... there's Deepseek v3.2 in your link (rank 7)

by tehsauce2 hours ago|

[-]

How does it do on gold stake?

by littlestymaar4 hours ago|

[-]

> . I don't think there are many people who posted their Balatro playthroughs in text form online

There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.

by sdwr4 hours ago|

[-]

Yeah, or just the steam text guides would be a huge advantage.

I really doubt it's playing completely blind

by acid__3 hours ago|

[-]

> Most (probably >99.9%) players can't do that at the first attempt

Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.

by Falsintio4 hours ago|

[-]

[dead]

by nubg6 hours ago|

[-]

Weren't we barely scraping 1-10% on this with state of the art models a year ago and it was considered that this is the final boss, ie solve this and its almost AGI-like?

I ask because I cannot distinguish all the benchmarks by heart.

by modeless5 hours ago|

[-]

François Chollet, creator of ARC-AGI, has consistently said that solving the benchmark does not mean we have AGI. It has always been meant as a stepping stone to encourage progress in the correct direction rather than as an indicator of reaching the destination. That's why he is working on ARC-AGI-3 (to be released in a few weeks) and ARC-AGI-4.

His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

by beklein4 hours ago|

https://x.com/fchollet/status/2022036543582638517

[-]

by joelthelion3 hours ago|

[-]

Do opus 4.6 or gemini deep think really use test time adaptation ? How does it work in practice?

by mapontosevenths3 hours ago|

[-]

> His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

That is the best definition I've yet to read. If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

Thats said, I'm reminded of the impossible voting tests they used to give black people to prevent them from voting. We dont ask nearly so much proof from a human, we take their word for it. On the few occasions we did ask for proof it inevitably led to horrific abuse.

Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

by estearum3 hours ago|

[-]

> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

This is not a good test.

A dog won't claim to be conscious but clearly is, despite you not being able to prove one way or the other.

GPT-3 will claim to be conscious and (probably) isn't, despite you not being able to prove one way or the other.

by dullcrisp2 hours ago|

[-]

An LLM will claim whatever you tell it to claim. (In fact this Hacker News comment is also conscious.) A dog won’t even claim to be a good boy.

by WarmWash3 hours ago|

[-]

>because we can no longer find tasks that are feasible for normal humans but unsolved by AI.

"Answer "I don't know" if you don't know an answer to one of the questions"

by mrandish1 hours ago|

[-]

I've been surprised how difficult it is for LLMs to simply answer "I don't know."

It also seems oddly difficult for them to 'right-size' the length and depth of their answers based on prior context. I either have to give it a fixed length limit or put up with exhaustive answers.

by CamperBob239 minutes ago|

[-]

The best pro/research-grade models from Google and OpenAI now have little difficulty recognizing when they don't know how or can't find enough information to solve a given problem. The free chatbot models rarely will, though.

by sva_3 hours ago|

[-]

> Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

I think being better at this particular benchmark does not imply they're 'smarter'.

by woah3 hours ago|

[-]

> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

Can you "prove" that GPT2 isn't concious?

by mapontosevenths3 hours ago|

[0]https://arxiv.org/pdf/2501.11120

[-]

If we equate self awareness with consciousness then yes. Several papers have now shown that SOTA models have self awareness of at least a limited sort. [0][1]

As far as I'm aware no one has ever proven that for GPT 2, but the methodology for testing it is available if you're interested.

[1]https://transformer-circuits.pub/2025/introspection/index.ht...

by pixl9747 minutes ago|

[-]

Honestly our ideas of consciousness and sentience really don't fit well with machine intelligence and capabilities.

There is the idea of self as in 'i am this execution' or maybe I am this compressed memory stream that is now the concept of me. But what does consciousness mean if you can be endlessly copied? If embodiment doesn't mean much because the end of your body doesnt mean the end of you?

A lot of people are chasing AI and how much it's like us, but it could be very easy to miss the ways it's not like us but still very intelligent or adaptable.

by criddell2 hours ago|

[-]

> The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.

Maybe it's testing the wrong things then. Even those of use who are merely average can do lots of things that machines don't seem to be very good at.

I think ability to learn should be a core part of any AGI. Take a toddler who has never seen anybody doing laundry before and you can teach them in a few minutes how to fold a t-shirt. Where are the dumb machines that can be taught?

by mapontosevenths1 hours ago|

[-]

Would you argue that people with long term memory issues are no longer conscious then?

by CamperBob237 minutes ago|

[-]

There's no shortage of laundry-folding robot demos these days. Some claim to benefit from only minimal monkey-see/monkey-do levels of training, but I don't know how credible those claims are.

by jrflowers29 minutes ago|

https://x.com/aedison/status/1639233873841201153#m

[-]

> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.

by hmmmmmmmmmmmmmm4 hours ago|

[-]

I don't think the creator believes ARC3 can't be solved but rather that it can't be solved "efficiently" and >$13 per task for ARC2 is certainly not efficient.

But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.

by fishpham6 hours ago|

[-]

Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)

by layer85 hours ago|

[-]

Isn’t the point of ARC that you can’t train against it? Or doesn’t it achieve that goal anymore somehow?

by egeozcan5 hours ago|

[-]

How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?

I tell this as a person who really enjoys AI by the way.

by mrandish1 hours ago|

[-]

> does leak per definition.

As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem.

The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.

IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi.

by D-Machine1 hours ago|

[-]

> which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.

So, I'd agree if this was on the true fully private set, but Google themselves says they test on only the semi-private:

> ARC-AGI-2 results are sourced from the ARC Prize website and are ARC Prize Verified. The set reported is v2, semi-private (https://storage.googleapis.com/deepmind-media/gemini/gemini_...)

This also seems to contradict what ARC-AGI claims about what "Verified" means on their site.

> How Verified Scores Work: Official Verification: Only scores evaluated on our hidden test set through our official verification process will be recognized as verified performance scores on ARC-AGI (https://arcprize.org/blog/arc-prize-verified-program)

So, which is it? IMO you can trivially train / benchmax on the semi-private data, because it is still basically just public, you just have to jump through some hoops to get access. This is clearly an advance, but it seems to me reasonable to conclude this could be driven by some amount of benchmaxing.

EDIT: Hmm, okay, it seems their policy and wording is a bit contradictory. They do say (https://arcprize.org/policy):

"To uphold this trust, we follow strict confidentiality agreements. [...] We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process."

But it surely is still trivial to just make a local copy of each question served from the API, without this being detected. It would violate the contract, but there are strong incentives to do this, so I guess is just comes down to how much one trusts the model providers here. I wouldn't trust them, given e.g. https://www.theverge.com/meta/645012/meta-llama-4-maverick-b.... It is just too easy to cheat without being caught here.

by WarmWash3 hours ago|

[-]

Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.

The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.

by D-Machine18 minutes ago|

[-]

> Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.

This may not be the case if you just e.g. roll the benchmarks into the general training data, or make running on the benchmarks just another part of the testing pipeline. I.e. improving the model generally and benchmaxing could very conceivably just both be done at the same time, it needn't be one or the other.

I think the right take away is to ignore the specific percentages reported on these tests (they are almost certainly inflated / biased) and always assume cheating is going on. What matters is that (1) the most serious tests aren't saturated, and (2) scores are improving. I.e. even if there is cheating, we can presume this was always the case, and since models couldn't do as well before even when cheating, these are still real improvements.

And obviously what actually matters is performance on real-world tasks.

by 4 hours ago|

[-]

deleted

by theywillnvrknw5 hours ago|

[-]

* that you weren't supposed to be able to

by 5 hours ago|

[-]

deleted

by jstummbillig5 hours ago|

[-]

Could it also be that the models are just a lot better than a year ago?

by bigbadfeline3 hours ago|

[-]

> Could it also be that the models are just a lot better than a year ago?

No, the proof is in the pudding.

After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.

If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.

by ctoth3 hours ago|

[-]

> If Gemini 3 DT was better we would have falling prices of electricity and everything else at least

Man, I've seen some maintenance folks down on the field before working on them goalposts but I'm pretty sure this is the first time I saw aliens from another Universe literally teleport in, grab the goalposts, and teleport out.

by WarmWash3 hours ago|

[1]https://www.bls.gov/news.release/cesan.nr0.htm

[-]

You might call me crazy, but at least in 2024, consumers spent ~1% less of their income on expenses than 2019[2], which suggests that 2024 is more affordable than 2019.

This is from the BLS consumer survey report released in dec[1]

[2]https://www.bls.gov/opub/reports/consumer-expenditures/2019/

Prices are never going back to 2019 numbers though

[-]

That's an improper analysis.

First off, it's dollar-averaging every category, so it's not "% of income", which varies based on unit income.

Second, I could commit to spending my entire life with constant spending (optionally inflation adjusted, optionally as a % of income), by adusting quality of goods and service I purchase. So the total spending % is not a measure of affordability.

by WarmWash2 hours ago|

[-]

Almost everyone lifestyle ratchets, so the handful that actually downgrade their living rather than increase spending would be tiny.

This part of a wider trend too, where economic stats don't align with what people are saying. Which is most likley explained by the economic anomaly of the pandemic skewing peoples perceptions.

by twoodfin33 minutes ago|

[-]

We have centuries of historical evidence that people really, really don’t like high inflation, and it takes a while & a lot of turmoil for those shocks to work their way through society.

by XenophileJKO5 hours ago|

https://chatgpt.com/s/m_698e2077cfcc81919ffbbc3d7cccd7b3

[-]

by aleph_minus_one5 hours ago|

[-]

I don't understand what you want to tell us with this image.

by fragmede4 hours ago|

[-]

they're accusing GGP of moving the goalposts.

by olalonde5 hours ago|

[-]

Would be cool to have a benchmark with actually unsolved math and science questions, although I suspect models are still quite a long way from that level.

[-]

Does folding a protein count? How about increasing performance at Go?

by 5 hours ago|

[-]

deleted

by verdverm6 hours ago|

https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...

[-]

Here's a good thread over 1+ month, as each model comes out

tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark

by Aperocky5 hours ago|

[-]

If you look at the problem space it is easy to see why it's toast, maybe there's intelligence in there, but hardly general.

by verdverm5 hours ago|

[-]

the best way I've seen this describes is "spikey" intelligence, really good at some points, those make the spikes

humans are the same way, we all have a unique spike pattern, interests and talents

ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"

by Aperocky5 hours ago|

[-]

You can get more spiky with AIs, whereas with human brain we are more hard wired.

So maybe we are forced to be more balanced and general whereas AI don't have to.

by verdverm5 hours ago|

[-]

I suspect the non-spikey part is the more interesting comparison

Why is it so easy for me to open the car door, get in, close the door, buckle up. You can do this in the dark and without looking.

There are an infinite number of little things like this you think zero about, take near zero energy, yet which are extremely hard for Ai

by pixl9737 minutes ago|

[-]

>Why is it so easy for me to open the car door

Because this part of your brain has been optimized for hundreds of millions of years. It's been around a long ass time and takes an amazingly low amount of energy to do these things.

On the other hand the 'thinking' part of your brain, that is your higher intelligence is very new to evolution. It's expensive to run. It's problematic when giving birth. It's really slow with things like numbers, heck a tiny calculator and whip your butt in adding.

There's a term for this, but I can't think of it at the moment.

[-]

You are asking a robotics question, not an AI question. Robotics is more and less than AI. Boston Dynamics robots are getting quite near your benchmark.

by tasuki2 hours ago|

[-]

> maybe there's intelligence in there, but hardly general.

Of course. Just as our human intelligence isn't general.

by mNovak5 hours ago|

[-]

I'm excited for the big jump in ARC-AGI scores from recent models, but no one should think for a second this is some leap in "general intelligence".

I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.

Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.

by causal4 hours ago|

[-]

Agreed. I love the elegance of ARC, but it always felt like a gotcha to give spatial reasoning challenges to token generators- and the fact that the token generators are somehow beating it anyway really says something.

by 1 hours ago|

[-]

deleted

by throw3108224 hours ago|

https://arcprize.org/arc-agi/2/

[-]

The average ARC AGI 2 score for a single human is around 60%.

"100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%."

by modeless4 hours ago|

[-]

Worth keeping in mind that in this case the test takers were random members of the general public. The score of e.g. people with bachelor's degrees in science and engineering would be significantly higher.

by throw3108223 hours ago|

[-]

Random members of the public = average human beings. I thought those were already classified as General Intelligences.

by imiric2 hours ago|

[-]

What is the point of comparing performance of these tools to humans? Machines have been able to accomplish specific tasks better than humans since the industrial revolution. Yet we don't ascribe intelligence to a calculator.

None of these benchmarks prove these tools are intelligent, let alone generally intelligent. The hubris and grift are exhausting.

by throw3108222 hours ago|

[-]

> Machines have been able to accomplish specific tasks...

Indeed, and the specific task machines are accomplishing now is intelligence. Not yet "better than human" (and certainly not better than every human) but getting closer.

by imiric2 hours ago|

[-]

> Indeed, and the specific task machines are accomplishing now is intelligence.

How so? This sentence, like most of this field, is making baseless claims that are more aspirational than true.

Maybe it would help if we could first agree on a definition of "intelligence", yet we don't have a reliable way of measuring that in living beings either.

If the people building and hyping this technology had any sense of modesty, they would present it as what it actually is: a large pattern matching and generation machine. This doesn't mean that this can't be very useful, perhaps generally so, but it's a huge stretch and an insult to living beings to call this intelligence.

But there's a great deal of money to be made on this idea we've been chasing for decades now, so here we are.

by warkdarrior1 hours ago|

[-]

> Maybe it would help if we could first agree on a definition of "intelligence", yet we don't have a reliable way of measuring that in living beings either.

How about this specific definition of intelligence?

   Solve any task provided as text or images.

AGI would be to achieve that faster than an average human.

by throw3108221 hours ago|

[-]

I still can't understand why they should be faster. Humans have general intelligence, afaik. It doesn't matter if it's fast or slow. A machine able to do what the average human can do (intelligence-wise) but 100 times slower still has general intelligence. Since it's artificial, it's AGI.

by guelo2 hours ago|

[-]

What's the point of denying or downplaying that we are seeing amazing and accelerating advancements in areas that many of us thought were impossible?

by D-Machine6 minutes ago|

[-]

It can be reasonable to be skeptical that advances on benchmarks may be only weakly or even negatively correlated with advances on real-world tasks. I.e. a huge jump on benchmarks might not be perceptible to 99% of users doing 99% of tasks, or some users might even note degradation on specific tasks. This is especially the case when there is some reason to believe most benchmarks are being gamed.

Real-world use is what matters, in the end. I'd be surprised if a change this large doesn't translate to something noticeable in general, but the skepticism is not unreasonable here.

by munksbeer36 minutes ago|

[-]

I would suggest it is a phenomenon that is well studied, and has many forms. I guess mostly identify preservation. If you dislike AI from the start, it is generally a very strongly emotional view. I don't mean there is no good reason behind it, I mean, it is deeply rooted in your psyche, very emotional.

People are incredibly unlikely to change those sort of views, regardless of evidence. So you find this interesting outcome where they both viscerally hate AI, but also deny that it is in any way as good as people claim.

That won't change with evidence until it is literally impossible not to change.

by CamperBob234 minutes ago|

[-]

The hubris and grift are exhausting.

And moving the goalposts every few months isn't? What evidence of intelligence would satisfy you?

Personally, my biggest unsatisfied requirement is continual-learning capability, but it's clear we aren't too far from seeing that happen.

by colordrops4 hours ago|

[-]

Wouldn't you deal with spatial reasoning by giving it access to a tool that structures the space in a way it can understand or just is a sub-model that can do spatial reasoning? These "general" models would serve as the frontal cortex while other models do specialized work. What is missing?

by causal4 hours ago|

[-]

That's a bit like saying just give blind people cameras so they can see.

by pixl9734 minutes ago|

[-]

I mean, no not really. These models can see, you're giving them eyes to connect to that part of their brain.

by amelius1 hours ago|

[-]

They should train more on sports commentary, perhaps that could give spatial reasoning a boost.

by aeyes5 hours ago|

https://arcprize.org/leaderboard

[-]

$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?

But the real question is if they just fit the model to the benchmark.

by onlyrealcuzzo4 hours ago|

[-]

Why 5-10 years?

At current rates, price per equivalent output is dropping at 99.9% over 5 years.

That's basically $0.01 in 5 years.

Does it really need to be that cheap to be worth it?

Keep in mind, $0.01 in 5 years is worth less than $0.01 today.

by willis9364 hours ago|

[-]

Wow that's incredible! Could you show your work?

by onlyrealcuzzo3 hours ago|

https://epoch.ai/data-insights/llm-inference-price-trends

[-]

by golem142 hours ago|

[-]

A grad student hour is probably more expensive…

by elromulous1 hours ago|

[-]

In my experience, a grad student hour is treated as free :(

by re-thc4 hours ago|

[-]

What’s reasonable? It’s less than minimum hourly wage in some countries.

by willis9364 hours ago|

[-]

Burned in seconds.

[-]

Getting the work done faster for the same money doesn't make the work more expensive.

You could slow down the inference to make the task take longer, if $/sec matters.

by igravious5 hours ago|

[-]

That's not a long time in the grand scheme of things.

by throwup2385 hours ago|

[-]

Speak for yourself. Five years is a long time to wait for my plans of world domination.

by tasuki2 hours ago|

[-]

This concerns me actually. With enough people (n>=2) wanting to achieve world domination, we have a problem.

by throwup2381 hours ago|

[-]

It’s not that I want to achieve world domination (imagine how much work that would be!), it’s just that it’s the inevitable path for AI and I’d rather it be me than then next shmuck with a Claude Max subscription.

by pixl9732 minutes ago|

[-]

I mean everyone with prompt access to the model says these things, but people like Sam and Elon say these things and mean it.

[-]

n = 2 is Pinky and the Brain.

by amelius4 hours ago|

[-]

Yes, you better hurry.

by mnicky6 hours ago|

[-]

Well, fair comparison would be with GPT-5.x Pro, which is the same class of a model as Gemini Deep Think.

by culi3 hours ago|

https://arcprize.org/leaderboard

[-]

Yes but with a significant (logarithmic) increase in cost per task. The ARC-AGI site is less misleading and shows how GPT and Claude are not actually far behind

by saberience5 hours ago|

[-]

Arc-AGI (and Arc-AGI-2) is the most overhyped benchmark around though.

It's completely misnamed. It should be called useless visual puzzle benchmark 2.

It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!

So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"

by CuriouslyC5 hours ago|

[-]

The puzzles are calibrated for human solve rates, but otherwise I agree.

by saberience5 hours ago|

[-]

My two elderly parents cannot solve Arc-AGI puzzles, but can manage to navigate the physical world, their house, garden, make meals, clean the house, use the TV, etc.

I would say they do have "general intelligence", so whatever Arc-AGI is "solving" it's definitely not "AGI"

by hmmmmmmmmmmmmmm5 hours ago|

[-]

You are confusing fluid intelligence with crystallised intelligence.

by casey24 hours ago|

[-]

I think you are making that confusion. Any robotic system in the place of his parents would fail with a few hours.

There are more novel tasks in a day than ARC provides.

by hmmmmmmmmmmmmmm4 hours ago|

[-]

Children have great levels of fluid intelligence, that's how they are able to learn to quickly navigate in a world that they are still very new to. Seniors with decreasing capacity increasingly rely on crystallised intelligence, that's why they can still perform tasks like driving a car but can fail at completely novel tasks, sometimes even using a smartphone if they have not used one before.

by mrbungie1 hours ago|

[-]

My late grandma learnt how to use an iPad by herself during her 70s to 80s without any issues, mostly motivated by her wish to read her magazines, doomscroll facebook and play solitaire. Her last job was being a bakery cashier in her 30s and she didn't learn how to use a computer in-between, so there was no skill transfer going on.

Humans and their intelligence are actually incredible and probably will continue to be so, I don't really care what tech/"think" leaders wants us to think.

by zeroonetwothree3 hours ago|

[-]

It really depends on motivation. My 90 year old grandmother can use a smartphone just fine since she needs it to see pictures of her (great) grandkids.

by 3 hours ago|

[-]

deleted

by karmasimida7 hours ago|

[-]

It is over

by baal80spam7 hours ago|