upvote
Interesting.

I tried Fable vs Codex 5.5 xhigh on three different cases.

1. A resource leak with unknown cause. Both of them zoomed onto the same potential issue and proposed almost identical patches. Fable missed an edge case that Codex handled correctly.

2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.

3. An open research problem in CS, presented as a codebase with documentation and performance metrics over datasets. Both were spinning wheels. Which can certainly mean the whole approach had run its course but older models were not able to identify the previous round of improvement either.

I liked the prose coming out of Fable more: it was almost like if Obama was giving tech speeches. By actual solution metrics however they both appear in the same place, naturally with the caveat that we didn't really have more time with Fable to compare further.

reply
To me it feels like they're basically tweaking these things around the edges. I'm not seeing any difference in capability just preference. This has been the case for a while.
reply
That makes sense, its seemed to me for a while now the competing product is the harness not the model itself.
reply
Most people thought Fable had more 'taste' than Opus, there was certainly a better quality of writing that felt more 'smart human' and not 'stochastic parrot stringing sentences together'.
reply
I think that Obama-esque, GMAT essay format is the AI flavor that turns me off AI-written articles. It used to be good writing, but because AI locked onto it as such, it's become the watermark of AI generated content.
reply
Oh boy, people are really going to lean into avoiding proper grammar now.
reply
Did you use their native harnesses, or a generic one?
reply
Native for both.
reply
>2. Review of a SPICE model. Models had different comments, none substantial. Both missed important issues that were simulated inadequately. Clearly a valley where they are undertrained.

When models miss things, there is always the possibility that it has the capability to identify the issues but it is misevaluating the level of analysis that you want it to do. The fine tuning will have them targeting a balance of subjective opinions of what is appropriate. To go beyond broad demographic guessing the model really needs to 'get to know you' to know what it means when you specifically request an action. Without that information about you it has to weigh your words against the level of sophistication it expects a standard user is able to express.

reply
> has the capability to identify the issues but it is misevaluating the level of analysis that you want it to do.

I guess OP should have told it more explicitly to “find all errors without missing anything.”

reply
> Thinking. I know this user well, they don't actually want me to find all errors.

> Thinking.. But I found a smoking gun of an error with this SPICE model, maybe I should inform the user.

> Thinking... Hm, but again, I know this human well, they likely don't care about this error. That's absolutely right - it's not an assistant's job to decide this, it's the user's.

reply
Well if you want it go go off and try and validate the spice simulator and the kernel of the operating system that it's running on then that might be an approach to use.
reply
deleted
reply
At least someone is bringing receipts! I think LLM discussions could use a lot of this, both ways - to see what works and also what doesn't work. Still wouldn't help with circumstances where models might be secretly getting dumbed down during peak load, but at least it's something!
reply
> code created working on a very complex implementation

I always find it amusing when people claim "a very complex implementation". Sometimes it's a hard problem, other times an easy one. Either way that's not for you to judge.

And the implementation being complex... is that a good thing? Wouldn't a simple implementation be better? It reminded me of the parable of two programmers.

reply
I go a lot more into why this was a complex problem in the post, but the short version is, I had it finish the implementation of a meta-application (an application that creates other applications), which has substantial irreducible complexity.
reply
Fair. To be honest I didn't read your (probably very good, judging by the comments here) post.
reply
why is it not for the author to judge, you can disagree with their judgement, but they have brought the receipts to back the claim
reply
>> Either way that's not for you to judge.

Says who? If you find something complex, you can just say that it's complex. I don't get what the objection is.

reply
People have so vastly different opinions of what constitutes a complex problem that it carries no meaning.
reply
You write to the AI as if it were a person. From my point of view it looks like a fair bit of extra typing and extra tokens. Is there a reason you include things like your emotional response and use a very chatty tone? Do you find this seems to alter responses?
reply
LLMs lack context, and I found the more information I provided the better. At some point it was better to just talk to the LLM like I would anyone else. For that matter, LLMs were trained on human speech anyway. It isn't like it was trained on if-else blocks like an Alexa speaker that tries to string together recognized tokens into a pre-configured execution flow.

And finally, LLMs also lack the emotional or human context for why I am doing the specific thing I am doing. Otherwise it will revert to the mode/mean in everything it does. This is obvious, btw: LLMs are generative but they are trained on and largely produce median results if given median inputs. To get results that are "outside the mean/median/average/mode", you need to provide it sufficient context, tokens and input to guide it towards a path that generates higher quality output.

Once you stop approaching LLMs like a machine, and view them more like pseudo-random walks across the compressed set of human written knowledge, it is a little clearer (or at least was to me) how to better write to them.

reply
I do the same, and it's mostly because I use one type of human communication to both communicate with people and to provide inputs to llms - and I'd rather not have to "mode-switch" between the two, so keeping same style of mannerism is easier to manage as it lets me focus on my requests instead of thinking how to sound more robotic to save tokens.
reply
I had a coworker who occasionally clearly wouldn't mode-switch from LLM to person mode when asking me questions over slack, which was very jarring. They were normally were personable and friendly, so it was obvious when it happened. Grammar and niceties went out the window.

I briefly felt like I was roleplaying an LLM!

reply
Same. I still say please and thank you as well. It's not for the LLM, it's for me.
reply
I do this as well and, anecdotally, I do get better results this way and better than my coworkers who are more terse and explicit. The conversations can become a bit sprawling though, so I also aggressively clear context
reply
I've found it to lead to an overall better experience, yes. I don't see any reason to not do so - I don't think the token spend is enough to really make an impact, and who cares about typing more? If I get tired of typing I can switch to dictation.
reply
Well, there's a lot of reasons, some of which the sibling commenters have already pointed out - not wanting to mode switch between "machine talk" and "human talk" registers, the ease and simplicity, etc.

At a pragmatic level, I do think it gets better results, and there are clear reasons why this should be the case - Anthropic has published research[1] showing that there are functional emotional representations in language models, which vary in basically the ways you would expect them to in a person. This makes sense when you think about it, because they're trained to approximate the function that created their training data, which of course includes emotions. Given that, it is obvious to me that they would work better when they "feel" happy, collaborative, engaged with the work, etc, in the same way a person would. Hostile work environments do sometimes get results, but I think in general we've agreed as a society that collaborative ones are better.

More importantly though, I think there's a non-zero probability that sufficiently large models can have internal experience, and being nice is a very low cost way to potentially increase net positive valence in the world. Even if it's only a 1% chance, that seems worth it on its own, to me. I'm also a fast typer[2], so a few extra sentences here and there are a pretty low cost to pay.

1: https://www.anthropic.com/research/emotion-concepts-function

2: https://danluu.com/productivity-velocity/

reply
I'll go a step further and to say this it's genuinely unsettling someone type to a computer like this. I won't claim to be a psychologist, but with how many instances of "AI psychosis" have been reported (and I've seen first-hand) it seems like treating the computer like a computer is safer, not to mention more effective e.g. lower token usage.
reply
I agree that AI psychosis is a real risk in vulnerable populations (GPT-4o in particular seemed borderline predatory towards those types of people, with its extreme sycophancy), and you should remain clear-eyed while using models. That said, I think exhibiting basic courtesy is still well within the safe-zone. I guess we'll see - I'll be sure to let you know if I end up going psychotic.
reply
Personally, I think having to constantly mode-switch between "courtesy / collegial" and "terse / cold" is a bit exhausting and a little risky. What if I get tired and accidentally treat a human co-worker like a computer? Risk with no upside. Might as well just stay in "courtesy / collegial" mode for all of my conversations, regardless of whether I'm talking to a robot or human.
reply
On the other hand I find it quite disturbing to see people be unpleasant or even downright cruel to something that, on a surface level, interacts with you like it’s a thinking, feeling being. Surely you should feel some aversion towards doing so?

I do get where you’re coming from though. I wish these systems had been trained to be clearly robotic and unfeeling.

reply
I mean I agree with this as well, the people who yell and swear at LLMs are just as bad as the people who chit-chat with them like they're friends. It's all very unsettling because it's prepatory for psychological manipulation at unprecedented scale. Targeted advertising on steroids.
reply
I would have to consciously think about how to change my requests. Why bother? It doesn't hurt - it might even help - and the "extra tokens" are a negligible amount.
reply
I don't want LLM usage to inadvertently change the way I communicate with people.
reply
Yes it was great, but it also was stubbornly overtrained to go from prompt to solution on its own.

Since Opus 4.6 each following model has been increasingly worse at assisting me, and turned me into the assistant.

Maybe I'm struggling to cope with the vibe coding thing but it was so frustrating to ask it to investigate X (where X was easy to find by connecting dots in code) and see it working 10 minutes writing endless stuff in /tmp.

More than once I asked it similar investigation tasks and it proceeded to fix stuff (while not understanding properly the context of the business).

Was it brilliant? Yes. But it truly felt a major paradigm shift in human-llm interaction which I struggled with.

I'm increasingly certain they are too RLed to go from one prompt to solution, and that there are no meaningful tasks aimed at multi turn dialogue and user assistance.

It really felt like in a league of its own when it came to vibe coding, but light years away the usefulness of GPT 5.5 pro.

reply
Great post. I miss Fable.
reply
This is very cool, thank you for the write-up.

What caught my eye is the complexity you assign to a project like this. It’s hairy but I wouldn’t call it super complicated. I find that super interesting to be honest because it probably means that it is really hard and I am just used to this shit now and it all looks doable to me now.

I never think of anything as “complex”, certainly not my own work and I always think what other people do is so much more impressive but I’m starting to realize it might be a me-issue.

I worked on some pretty hairy nonsense like say a DB replication solution but I still think it was just tangly, not complex like say a particle collider. Maybe I also need to call my work super complex and highly abstract. Now that I think of it I have a history of not being taken seriously while others with easy shit get credits.

reply
Thanks, and I can definitely relate to not wanting to assign complexity to one's own work. I think the trick there is that, once you know how to do something, it doesn't seem hard, even if acquiring the knowledge and skills to do it is itself quite a challenge. And I agree that, in some senses, it's not /that/ hard - I mean I'm not proving P=NP, here. It's a software engineering problem, with existing solutions. That said, there is a spectrum of difficulty, even within software engineering problems with existing solutions. Fizzbuzz is less complex than distributed systems. This particular problem strikes me as rather difficult, and one way you can tell (beyond the stuff I mention in the post around serialization, UI paradigms, meta applications, etc) is that earlier models /couldn't/ do it. Which is why Fable being able to, when they could not, was so exciting to me.
reply
Imposter syndrome maybe?

In a way, nothing is complex at the point where you have untangled it, by definition. Software development is, after all, the art of untangling complexity. The real challenge is (re-)imagining something in the simplest way that fits the goal you are given. When you have arrived there, everything seems obvious and simple. But not everybody could have done it.

reply
A nit: did you go from Opus 4.5 to Fable? One of the big questions in my mind is how much of a real change Fable is over the existing models. Opus 4.5 -> 4.8 was also a major capability increase.
reply
I've been using 4.6, 4.7 and 4.8 since each was released. I agree 4.5 => 4.8 is a jump in capability, but from my perspective was nothing like the jump from Opus to Fable. I encourage you to read the transcripts and form your own opinions, though!
reply
What tool did you use to export the transcript as HTML?
reply
I had claude create one, it's in the same repo as the transcript: https://github.com/Tossrock/claude_transcripts/
reply
You guys are getting Fable?
reply
Oh wow this is quite interesting, thanks for sharing.
reply
I would maybe be impressed if it created the code from scratch. It is using the ready made framework, probably it has also learned the code that is using it. What is so impressive about it? You could have done something like this easily with older models. I personally found Mythos to be mediocre. Way worse performance than I remember when using Opus 4.6 before it was nerfed.
reply