This is a common occurrence.
I still regularly run into the issue where it just makes up API endpoints, CLI commands, or add flags that simply don’t exist.
I also regularly ask it things and it gives me a bad answers, so I push back, and it says something to the effect of “you’re right, I didn’t consider that, let me look at that more”… then tells me the exact opposite of the previous response.
Or it “thing X has never happened”, and I ask what about <insert example>, and it goes to look it up and says, “oh, thing X actually did happen.”
I run into this daily. Multiple times per day. How can I trust a system like this? Are people just blindly accepting what the LLM says as truth? Is that why people think it’s good?
Wouldn’t it be great? I’m still waiting for reproducibility from LLMs.
Give me a question which the LLM answers vastly differently on runs.
I keep hearing how it's dumb and wrong but no one ever shares the chat or prompt
How many days of the week contain the letter d?
The answer I get with ChatGPT, and Grok is 3 and 6 with Claude.
In Firefox I got 6. In Chrome I got 7. LLMs are not even self-consistent.
I have the screenshots if anyone cares.
Teams Copilot meeting assistant auto-renamed a meeting title/summary that’s now prominently placed at the top to “Month end close wrap up discussion“ because someone posted in chat “sorry can’t make the meeting, we’re wrapping up month end close”.
Really confused the next guy who joined the meeting and derailed things for a minute or two before we could get back on topic.