I tried Fable vs Codex 5.5 xhigh on three different cases.
1. A resource leak with unknown cause. Both of them zoomed onto the same potential issue and proposed almost identical patches. Fable missed an edge case that Codex handled correctly.
2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.
3. An open research problem in CS, presented as a codebase with documentation and performance metrics over datasets. Both were spinning wheels. Which can certainly mean the whole approach had run its course but older models were not able to identify the previous round of improvement either.
I liked the prose coming out of Fable more: it was almost like if Obama was giving tech speeches. By actual solution metrics however they both appear in the same place, naturally with the caveat that we didn't really have more time with Fable to compare further.
When models miss things, there is always the possibility that it has the capability to identify the issues but it is misevaluating the level of analysis that you want it to do. The fine tuning will have them targeting a balance of subjective opinions of what is appropriate. To go beyond broad demographic guessing the model really needs to 'get to know you' to know what it means when you specifically request an action. Without that information about you it has to weigh your words against the level of sophistication it expects a standard user is able to express.
I guess OP should have told it more explicitly to “find all errors without missing anything.”
> Thinking.. But I found a smoking gun of an error with this SPICE model, maybe I should inform the user.
> Thinking... Hm, but again, I know this human well, they likely don't care about this error. That's absolutely right - it's not an assistant's job to decide this, it's the user's.
I always find it amusing when people claim "a very complex implementation". Sometimes it's a hard problem, other times an easy one. Either way that's not for you to judge.
And the implementation being complex... is that a good thing? Wouldn't a simple implementation be better? It reminded me of the parable of two programmers.
Says who? If you find something complex, you can just say that it's complex. I don't get what the objection is.
And finally, LLMs also lack the emotional or human context for why I am doing the specific thing I am doing. Otherwise it will revert to the mode/mean in everything it does. This is obvious, btw: LLMs are generative but they are trained on and largely produce median results if given median inputs. To get results that are "outside the mean/median/average/mode", you need to provide it sufficient context, tokens and input to guide it towards a path that generates higher quality output.
Once you stop approaching LLMs like a machine, and view them more like pseudo-random walks across the compressed set of human written knowledge, it is a little clearer (or at least was to me) how to better write to them.
I briefly felt like I was roleplaying an LLM!
At a pragmatic level, I do think it gets better results, and there are clear reasons why this should be the case - Anthropic has published research[1] showing that there are functional emotional representations in language models, which vary in basically the ways you would expect them to in a person. This makes sense when you think about it, because they're trained to approximate the function that created their training data, which of course includes emotions. Given that, it is obvious to me that they would work better when they "feel" happy, collaborative, engaged with the work, etc, in the same way a person would. Hostile work environments do sometimes get results, but I think in general we've agreed as a society that collaborative ones are better.
More importantly though, I think there's a non-zero probability that sufficiently large models can have internal experience, and being nice is a very low cost way to potentially increase net positive valence in the world. Even if it's only a 1% chance, that seems worth it on its own, to me. I'm also a fast typer[2], so a few extra sentences here and there are a pretty low cost to pay.
1: https://www.anthropic.com/research/emotion-concepts-function
I do get where you’re coming from though. I wish these systems had been trained to be clearly robotic and unfeeling.
Since Opus 4.6 each following model has been increasingly worse at assisting me, and turned me into the assistant.
Maybe I'm struggling to cope with the vibe coding thing but it was so frustrating to ask it to investigate X (where X was easy to find by connecting dots in code) and see it working 10 minutes writing endless stuff in /tmp.
More than once I asked it similar investigation tasks and it proceeded to fix stuff (while not understanding properly the context of the business).
Was it brilliant? Yes. But it truly felt a major paradigm shift in human-llm interaction which I struggled with.
I'm increasingly certain they are too RLed to go from one prompt to solution, and that there are no meaningful tasks aimed at multi turn dialogue and user assistance.
It really felt like in a league of its own when it came to vibe coding, but light years away the usefulness of GPT 5.5 pro.
What caught my eye is the complexity you assign to a project like this. It’s hairy but I wouldn’t call it super complicated. I find that super interesting to be honest because it probably means that it is really hard and I am just used to this shit now and it all looks doable to me now.
I never think of anything as “complex”, certainly not my own work and I always think what other people do is so much more impressive but I’m starting to realize it might be a me-issue.
I worked on some pretty hairy nonsense like say a DB replication solution but I still think it was just tangly, not complex like say a particle collider. Maybe I also need to call my work super complex and highly abstract. Now that I think of it I have a history of not being taken seriously while others with easy shit get credits.
In a way, nothing is complex at the point where you have untangled it, by definition. Software development is, after all, the art of untangling complexity. The real challenge is (re-)imagining something in the simplest way that fits the goal you are given. When you have arrived there, everything seems obvious and simple. But not everybody could have done it.