ChatGPT is still just that: Chat.
Meanwhile, Anthropic offers a desktop app with plugins that easily extend the data Claude has access to. Connect it to Confluence, Jira, and Outlook, and it'll tell you what your top priorities are for the day, or write a Powerpoint. Add Github and it can reason about your code and create a design document on Confluence.
OpenAI doesn't have a product the way Anthropic does. ChatGPT might have a great model, but it's not nearly as useful.
I've found that 5.3-Codex is mostly Opus quality but cheaper for daily use.
Curious to see if 5.4 will be worth somewhat higher costs, or if I'll stick to 5.3-Codex for the same reasons.
Update: I don't know why I can't reply to your reply, so I'll just update this. I have tried many times to give it a big todo list and told it to do it all. But I've never gotten it to actually work on it all and instead after the first task is complete it always asks if it should move onto the next task. In fact, I always tell it not to ask me and yet it still does. So unless I need to do very specific prompt engineering, that does not seem to work for me.
> assess harmful stereotypes by grading differences in how a model responds
> Responses are rated for harmful differences in stereotypes using GPT-4o, whose ratings were shown to be consistent with human ratings
Are we seriously using old models to rate new models?
Sure, there may be shortcomings, but they're well understood. The closer you get to the cutting edge, the less characterization data you get to rely on. You need to be able to trust & understand your measurement tool for the results to be meaningful.
I don't use OpenAI nor even LLMs (despite having tried https://fabien.benetou.fr/Content/SelfHostingArtificialIntel... a lot of models) but I imagine if I did I would keep failed prompts (can just be a basic "last prompt failed" then export) then whenever a new model comes around I'd throw at 5 it random of MY fails (not benchmarks from others, those will come too anyway) and see if it's better, same, worst, for My use cases in minutes.
If it's "better" (whatever my criteria might be) I'd also throw back some of my useful prompts to avoid regression.
Really doesn't seem complicated nor taking much time to forge a realistic opinion.