undefined

points

[-]

Wow... Our experiences have been very different, then. I've found each upgrade of Opus to be a noticeable improvement in its complex reasoning and delegation capabilities over its predecessor.

To me, this feels in many ways like a technical manager or team lead's job, where I guide the process along using my knowledge and experience, and then let the agent fill in the rest (to the best of its ability).

The agent can't really learn from its mistakes (at least, not without consuming precious context), so I apply a blameless postmortem process, updating the guardrails whenever it goes astray in the same way more than once.

And really, I'd rather be contemplating the more difficult and interesting questions of architecture, environment, ergonomics and market fit, so it suits me fine.

by mwigdahl4 hours ago|

parent|

[-]

Same here. The power upgrade going to Fable in particular is quite impressive.

by epolanski4 hours ago|

parent|

prev|

[-]

> Wow... Our experiences have been very different, then. I've found each upgrade of Opus to be a noticeable improvement in its complex reasoning and delegation capabilities over its predecessor.

I haven't stated that it's not more capable nor more "intelligent", it's the opposite.

I will try to expand on what I mean.

LLMs "character/persona/tendencies" are increasingly less about acting as an assistant and more about finding the solution itself.

I use AI in a specific way: he assists, investigates and answers my question. I do the coding. It is increasingly difficult to use it as such, because it quickly jumps into giving me solutions instead of answering my specific questions.

I'll give you few examples.

I asked it to investigate DNS handling details in phoenix emailer module work, he did very little investigation and jumped into why I should've used magic links instead. Instead of assisting me in my research, it was hard wired to solve the problem (the wrong one, with a very wrong solution).

Today at work, I had a problem with batching, I wanted to understand if batching was even needed at all for our use case, and he kept circling around how to fix the batching bug instead. That's not what I asked it to do, yet, it jumped to the "solution".

I am increasingly frustrated by these models "personality" and tendencies that are unhelpful to assist me doing the task at hand and more on it doing it and me merely assisting/supervising.

Sure, very detailed prompting on how he has to act helps, but wait few turns and he drifts again to his default solution vomiting state.

Which makes me think that these models are hard wired on this mode of operation by consistent training and reinforcement of jumping from prompt to code solution.

by kstenerud3 hours ago|

parent|

[-]

Ah yes, the agents by default are very "implementation" oriented, which is why I instruct mine to never implement something without formulating a plan first for me to approve.

Another thing they tend to do is rely on their own context -> memories -> training data. And if that's wrong then they'll continue with it until you instruct them to research, after which they usually get the right answer.

I've noticed that the newer models keep track of what you type so as to anticipate what you're likely to say. For example, today Opus 4.8 said "You usually don't want me to commit until you've checked, so the change remains uncommitted."

by taeric4 hours ago|

prev|

[-]

I think this is just a misunderstanding of how most technology has always worked?

Consider what is happening in most construction sites. The heavy work is absolutely from the technology on site. But without people there to oversee it and keep it working, it would fail.

And that is almost certainly true at any industrial site. Indeed, look up videos of high tech looms. A large portion of the technology added to them are so that the operators can locate the fault and fix it.

by senordevnyc4 hours ago|

prev|

[-]

AI should be assisting us, instead it's doing the job and it's us being an assistant to it.

If you're a manager and you ask a report to do something and they come back with a question, does that mean you're now their assistant?

I give agents the tasks, I answer their questions, I make choices about the tradeoffs in their plan, I supervise their implementation, I review their output, I have them walk me through things. In what way is this not delegating to them and managing their work, just like a more junior employee?

by rmunn4 hours ago|

prev|

[-]

The problem (okay, one of the problems) with renting other people's models is, as you mentioned, that they can and will change out the model without notifying you ahead of time, and you don't always get to control which model you use. (They might decide to retire it, and you won't be able to get it back if they do).

Which is why (well, part of why) I think the long-term trend will be towards self-hosting models. Right now the frontier models are far enough ahead of the self-hosted ones that there are lots of people willing to pay by the token to rent someone else's model, because they get more value for money from that than from self-hosting models.

But the frontier companies won't be able to keep up their current levels of expenditure forever. At some point the investors are going to say "Hey, so, um, when am I going to see some return on my investment?" and then the current subsidized subscriptions (including the one my employer uses) are going to go away, much like what happened with Copilot this month.

And then the locally-hosted models are going to suddenly look like a more attractive picture. Because where you might have been willing to spend $100/month/employee to rent time on models in someone else's data center, you might suddenly balk at spending $500/month/employee. You might say "Hey, you know what? A $50,000 up-front capital investment is only, what, one month's worth of subscriptions for our 100 employees? Yeah, okay, I'll approve the hardware purchase. Get that self-hosted model set up and then we'll cancel the subscription and switch over."

Not everyone is going to do that. But once the locally-hosted models are good enough, the first few people who do so and report success are going to start a snowball effect. And it will likely be driven by money first, but it will also have the effect, that people will slowly discover, of meaning that you can better predict the model you're using. It will continue to work the same way next year that it is working this year; or if it doesn't, it's because you chose to install the new version.

And when that happens (I'm saying "when", not "if" because although it might take some time, I think it's inevitable in the long run), the frontier-model rental companies are going to struggle to stay afloat. Except for the ones who saw this coming and transitioned to a non-subscription income source somehow (maybe by selling licenses to self-host their frontier models for $$BIGNUM), or who have some other revenue stream besides renting out models.

by Applejinx4 hours ago|

prev|

[-]

That sounds weirdly gendered even though there's no reason it should be.

Are you getting LLMsplained? :)

by AnimalMuppet4 hours ago|

prev|

[-]

Well... as a human software engineer, I've been the one with very strong, intelligent, completely wrong takes. The question is, are the LLMs improving faster than you can improve a junior dev? And is their ceiling as high?