undefined

points

[-]

Just today, the LLM based auto-review that my company enabled for all PRs edited my PR description to confidently assert that I had added a new RPC. I had not. I deleted code and nothing else. Nothing was added. The RPC it claimed I added did not exist.

This is a common occurrence.

by al_borland18 hours ago|

prev|

[-]

LLMs are nondeterministic, so it’s impossible to make something 100% reproducible. Even if it has an issue, it might do it in a different way. If it’s well publicized, they’ll patch that very specific example, but the foundational issue is still there (like counting the R’s in strawberry).

I still regularly run into the issue where it just makes up API endpoints, CLI commands, or add flags that simply don’t exist.

I also regularly ask it things and it gives me a bad answers, so I push back, and it says something to the effect of “you’re right, I didn’t consider that, let me look at that more”… then tells me the exact opposite of the previous response.

Or it “thing X has never happened”, and I ask what about <insert example>, and it goes to look it up and says, “oh, thing X actually did happen.”

I run into this daily. Multiple times per day. How can I trust a system like this? Are people just blindly accepting what the LLM says as truth? Is that why people think it’s good?

by jagged-chisel19 hours ago|

prev|

[-]

> Reproducible would be great

Wouldn’t it be great? I’m still waiting for reproducibility from LLMs.

by bko18 hours ago|

parent|

[-]

Can you reproduce irreproducibility?

Give me a question which the LLM answers vastly differently on runs.

I keep hearing how it's dumb and wrong but no one ever shares the chat or prompt

by jagged-chisel16 hours ago|

parent|

[-]

Yes. https://news.ycombinator.com/item?id=48420769

by uxhacker18 hours ago|

parent|

prev|

[-]

Try this with ChatGPT or GROK or Claude

How many days of the week contain the letter d?

The answer I get with ChatGPT, and Grok is 3 and 6 with Claude.

by jagged-chisel17 hours ago|

parent|

[-]

I just used ChatGPT only, twice. Web interface in a Firefox private window, and in a Chrome incognito window. I asked them both the identical question "How many names of the days of the week contain the letter D?"

In Firefox I got 6. In Chrome I got 7. LLMs are not even self-consistent.

I have the screenshots if anyone cares.

by toraway15 hours ago|

prev|

[-]

Bad example but since it literally just happened a few hours ago:

Teams Copilot meeting assistant auto-renamed a meeting title/summary that’s now prominently placed at the top to “Month end close wrap up discussion“ because someone posted in chat “sorry can’t make the meeting, we’re wrapping up month end close”.

Really confused the next guy who joined the meeting and derailed things for a minute or two before we could get back on topic.