undefined

points

by aembleton3 hours ago |

comments

by agrounds3 hours ago|

[-]

>if an AI is confidently telling you something wrong it's hard to work with.

But they all do that. It just comes with the territory. Grok will absolutely do the same thing another time you try it.

by aembleton1 hours ago|

parent|

[-]

> Grok will absolutely do the same thing another time you try it.

True; it's just not happened yet. It will at some point though. With the Sunnypilot example it right out told me that it is not possible on that fork which I appreciated. The others all seem to hallucinate some setting.

by ToucanLoucan3 hours ago|

parent|

prev|

[-]

It is really, really genuinely concerning how many people think there are profound measurable differences between these things.

Like yeah tonally I guess there are. But with regard to references and information? You’re literally just using three different slot machines and claiming one is hot.

I suppose though I shouldn’t be that surprised then since Vegas and every other casino on Earth has been built on duping people in that exact way.

by aembleton1 hours ago|

parent|

[-]

> You’re literally just using three different slot machines and claiming one is hot.

It's a fair point. I haven't tested many queries across them all and checked their answers, but if I want to ask one of them a question - right now its Grok just because I trust its answers more.

by ToucanLoucan1 hours ago|

parent|

[-]

It's not a methodology problem, it's a test-ability problem. LLMs are not deterministic. You can ask the same question to the same LLM five times and you'll likely get at least 3 answers.

Again. Slot machine.

by Ukv1 hours ago|

parent|

[-]

You can meaningfully test if one slot machine hits the jackpot more often than another, just that the methodology should involve a large number of repeats rather than a few anecdotes. There are some LLM leaderboard sites that do it with blind comparisons.

by cyanydeez2 hours ago|

parent|

prev|

[-]

humans make poor scientists. most people have already made a decision before they run any tests.

the smartest among them just make the tests complicated and biased; the less intelligent just cherry pick.

of course, would you really expect anyone to do real rsearch in this economy?

by 1 hours ago|

prev|

[-]

deleted

by alex113823 minutes ago|

prev|

[-]

Hey, have you used Claude much? What are your experiences with it

by aembleton10 minutes ago|

parent|

[-]

No, I've not tried Claude.