I must be missing something or supremely lucky because I feel like I’ve never hit these “stupid” moments.
If I do, it’s probably because I forgot to switch off of haiku for some tiny side thing I was doing before going back to planning.
But even when Opus is running healthy, it still doesn't address the underlying issue that these models can only do so much. I have had Opus build out a bunch of apps but I'm still finding my time absorbed as soon as it comes to anything genuinely exceeding "CRUD level difficulty". Ask it to fix a subtle visual alignment issue, make a small change to a completely novel algorithm, or just fix a tiny bug without having to watch for "Oh, this means I should rewrite module <X>" is something that simply isn't possible while still being able to stand over the work.
It's not to say I don't get a massive benefit from these tools, I just think it's possible to be asking too much of them, and that's maybe the real problem to solve.
2 weeks ago, I had only hit my limit a single time and that was when I had multiple agents doing codebase audits.
They didn’t do a great job of explaining it. I wonder how many people got used to the 2X limits and now think Anthropic has done something bad by going back to normal
That's EXACTLY and ALL I've been doing!
Using Codex and Claude both side by side to view my Godot components framework open source project (link in profile)
Claude has been..ugh.. bad, to put it mildly, on the same content and the same prompts.
For JSON to text formatting it works well on a one-round basis. So I think you should realistically have an evaluation ready to go so you can use it on these models. I currently judge them myself but people often use a smart LLM as judge.
Today writing eval harness with Claude is 5 min job. Do it yourself so you can explore as quants on Gemma get better.
On top of that their $20 plan has much higher usage limits than Anthropic's $20 plan and they allow its use in e.g. opencode. So you can set up opencode to use both OpenAI's codex plan plus one of the more intelligent Chinese models so you can maximize your usage. Have it fully plan things out using GPT 5.4, write code using e.g. Qwen 3.6, then switch back to GPT 5.4 for review
That is I get more variance between opus 4.6 and itself than I do between the sota models.
I don’t have the budget for statistical relevance but I’m convinced people claiming broad differences are just vibing, or there are times when agent features make a big difference.
Oh no, there's plenty of us willing to say we told you so.
What's more interesting to me is what it's going to look like if big companies start removing "AI usage" from their performance metrics and cease compelling us to use it. More than anything else, that's been the dumbest thing to happen with this whole craze.
Would really love some path forward where the AI parts only poke out as single fields in traditional user interfaces and we can forget this whole episode
My primary interest is using small edge models to perform specific engineering tasks. In this pursuit I do like to use gemini-cli or Antigravity with Claude a few times a week as coding assistants, but I am using relatively few tokens to do this.
I also waste a lot of time, but this is fun time: experimenting with open source coding agents with local models just to see what kinds of results I can get. This is mostly a waste of time, but I enjoy it.
My other favorite use pattern: once or twice a week I like to use the iOS Gemini app in voice mode, and once a month also use video input. I really like this, but it is not life changing.
Externalities matter: I never use frontier LLM-based AI without thinking of energy, data center, and environmental costs.
And video calling did take off, plenty of people use facetime and almost everybody working in an office uses some form of video calls. Criticizing the early attempts at getting video calling working because they hadn't taken off yet (I remember them being advertised on "video phones" with 56k modems), of course someone was going to have the idea and implement before it was quite reasonable.
To help with understanding that perspective, I cannot imagine a scenario where I would ask a device connected to the internet to turn off the lights. I literally never wanted this. A physical switch is a 100% non negotiable for me. I feel the same way about non-mechanical car doors.
Perhaps due to that outlook I was always puzzled about the entire idea of an "assistant". It's interesting for me to see, that there are people out there who actually want that "assistant".
Ever end up cooking or something when the phone/doorbell rings and you want to pause the music? Have your hands full and wanted to open a door? Hear the weather and then the news as you brew coffee or put your shoes on (without interaction with a bright screen)?
You should save some money and keep some privacy doing it your way :)
Maybe you're a little strange but it cannot be that much of a stretch for you to consider using speech to ask for things.
Not wanting to hide things behind Internet connected computers is fine, being unable to imagine wanting to use your voice to ask for things is a little silly.
I regret paying Google for a one year AI subscription last spring (although it was a deep discount over the regular $20/month cost) because it has kept me from experimenting with many venders (but it was a fantastic deal financially).
I just put a reminder on my calendar to try OpenCode zen when my subscription ends.
Dealing with these right now with ChatGPT. Bricked a thread which I didn’t even know was possible.
I’m kind of confused by these takes from HN readers. I could see LinkedIn bros getting reality checked when they finally discover that LLMs aren’t magic, but I’m confused about how a developer could go all-in on AI and not immediately realize the limitations of the output.
I'm "all-in" on AI code generation. I very much realise their limitations, it's like any tool really. I do think they're magic, you just need to learn how to weld the power.