upvote
deleted
reply
It's a huge jump across the board. I was really impressed with its ability to test usability in Claude for Chrome. Very opinionated but in a good way. It was good while it lasted.
reply
Wow unsure why you are getting downvoted. It’s just odd. I just don’t get the skepticism towards this model. It’s released and it’s amazing. The hype was real and I can see why the researchers were anxious about releasing it.
reply
I did not see that?

It's way more _proactive_ than the old models, sometimes in ways it shouldn't really be proactive. But it produces _more_ slop than 4.8, and I have not seen any real breakthroughs from it.

Edit: to give an example, I'm working on integrating a self-hosting auth provider into our app. So I gave it a prompt to create a "bootstrap" script that would create pre-configured settings for the local installation.

Fable did it. And then proceeded (unprompted) to test it by killing the running server, removing the database, re-initializing and (trying) to verify that the bootstrap produced identical results.

Well, yeah. Great. I can see how this "bias for action" works for security research and one-shot projects, not so sure about regular development.

I just tried that with Opus, and it produced a similar bootstrap script but did not start the test by itself.

reply
Ah that I will admit. It gets shit done one way or another haha. This is why a sandboxed environment and a reproducible test DB is key here. I give read only access to my dev DB to my Claude, really removes the temptation that it increasingly has to “cheat”. E.g. doing something hacky and fixing the DB manually in a way that doesn’t solve the problem everywhere.

Personally I love when the AI has this amount of problem solving. But you have to build the environment around it that encourages solving problems right the first time, versus taking the easy way out and hacking out a solution.

It’s just all about constraining the behavior of the LLM into productive and permanent directions. The more advanced it gets, the more it feels like designing engineering processes rather than coding. Personally it’s a fun change of pace and it’s giving me a lot of opportunities to look at the project in working on at a wider lens. I find having to pump out features makes you myopic in a sense. I really miss the control I had over writing it all by hand, but I love just being able to build software. At the end of the day, what do you want? That’s the question I’ve had to grapple recently.

Personally I don’t mind switching gears to the bigger picture of why the software exists and what purpose it serves

reply
This honestly sounds like a tweaked system prompt more than anything. Maybe it is an attempt to make the model appear stronger?
reply
[dead]
reply