undefined

upvote

points

by maccard6 hours ago |

upvote

by krastanov6 hours ago|

[-]

This is fascinating to me. I completely believe you and I will not bother you with all the common "but did you try to tell it this or that" responses, but this is such a different experience from mine. I did the exact same task with claude in the Julia language last week, and everything worked perfectly. I am now in the habit of adding "keep it simple, use only public interfaces, do not use internals, be elegant and extremely minimal in your changes" to all my requests or SKILL.md or AGENTS.md files (because of the occasional failure like the one you described). But generally speaking, such complete failures have been so very rare for me, that it is amazing to see that others have had such a completely different experience.

reply

upvote

by aqua_coder5 hours ago|

[-]

My experience with working with AI agents is that they can be verbose and do things that are too over complicated by default. Unless directed explicitly. Which may be the reason for this discrepancy.

reply

upvote

by jayd165 hours ago|

[-]

You say it doesn't fail but you also mention all these work around you know and try...sounds like it fails a lot but your tolerance is different.

reply

upvote

by jmalicki5 hours ago|

[-]

Most people I've seen complain say things like "I asked it for code and it didn't compile."

The real magic of LLMs comes when they iterate until completion until the code compiles and the test passes, and you don't even bother looking at it until then.

Each step is pretty stupid, but the ability to very quickly doggedly keep at it until success quite often produces great work.

If you don't have linters that are checking for valid syntax and approved coding style, if you don't have tests to ensure the LLM doesn't screw up the code, you don't have good CI, you're going to have a bad time.

LLMs are just like extremely bright but sloppy junior devs - if you think about putting the same guardrails in place for your project you would for that case, things tend to work very well - you're giving the LLM a chance to check its work and self correct.

It's the agentic loop that makes it work, not the single-shot output of an LLM.

reply

upvote

by alecbz4 hours ago|

[-]

Stuff like this works for things that can be verified programmatically (though I find LLMs still do occasionally ignore instructions like this), but ensuring correct functionality and sensible code organization are bigger challenges.

There are techniques that can help deal with this but none of them work perfectly, and most of the time some direct oversight from me is required. And this really clips the potential productivity gains, because in order to effectively provide oversight you need to page in all the context of what's going on and how it ought to work, which is most of what the LLMs are in-theory helping you with.

LLMs are still very useful for certain tasks (bootstrapping in new unfamiliar domains, tedious plumbing or test fixture code), but the massive productivity gains people are claiming or alluding to still feel out of reach.

reply

upvote

by jmalicki4 hours ago|

[-]

It depends - there are some very very difficult things that can still be easily verifiable!

For instance, if you are working on a compiler and have a huge test database of code to compile that all has tests itself, "all sample code must compile and pass tests, ensuring your new optimizer code gets adequate branch coverage in the process" - the underlying task can be very difficult, but you have large amounts of test coverage that have a very good chance at catching errors there.

At the very least "LLM code compiles, and is formatted and documented according to lint rules" is pretty basic. If people are saying LLM code doesn't compile, then yes, you are using it very incorrectly, as you're not even beginning to engage the agentic loop at all, as compiling is the simplest step.

Sure, a lot of more complex cases require oversight or don't work.

But "the code didn't compile" is definitely in "you're holding it wrong" territority, and it's not even subtle.

reply

upvote

by alecbz3 hours ago|

[-]

Yeah performance optimization is potentially another good area for LLMs to shine, if you already have a sufficiently comprehensive test suite, because no functionality is changing. But if functionality is changing, you need to be in the loop to, at the very least, review the tests that the LLM outputs. Sometimes that's easier than reviewing the code itself, but other times I think it requires similar levels of context.

But honestly I think sane code organization is the bigger hurdle, which is a lot harder to get right without manual oversight. Which of course leads to the temptation to give up on reviewing the code and just trusting whatever the LLM outputs. But I'm skeptical this is a viable approach. LLMs, like human devs, seem to need reasonably well-organized code to be able to work in a codebase, but I think the code they output often falls short of this standard.

(But yes agree that getting the LLM to iterate until CI passes is table-stakes.)

reply

upvote

by jmalicki3 hours ago|

[-]

Strongly agreed!

I think getting good code organization out of an LLM is one of the subtler things - I've learned quite a bit about what sort of things need to be specified, realizing that the LLM isn't actively learning my preferences particularly well, so there are some things about code organization I just have to be explicit about.

Which is more work, but less work than just writing the code myself to begin with.

reply

upvote

by zorak8me5 hours ago|

[-]

Providing instruction and context doesn’t seem like a “workaround”.

reply

upvote

by plagiarist5 hours ago|

[-]

IME it does fail pretty hard at first. One has to build up a little library of markdown and internal library of prompt techniques. Then it starts working okay. I agree there is a hurdle still, trying it on one task doesn't really get one over the hurdle.

reply

upvote

by shafyy6 hours ago|

[-]

It's almost like.... LLMs are non-deterministic and hallucinating... Oh wait?!

reply

upvote

by krastanov4 hours ago|

[-]

non-deterministic does not mean it is not biased towards a particular type of results (helpful results)

reply

upvote

by linsomniac4 hours ago|

[-]

My friend, with all due respect, I don't think this is a problem with the AI.

I don't know anything about DotNet, but I just fired up Claude Code in an empty directory and asked it to create an example dotnet program using the Tomlyn library, it chugged away and ~5 minutes later I did a "grep Deserialize *" in the project and it came up with exactly the line you wanted (in your comment here) it to produce: var model = TomlSerializer.Deserialize<TomlTable>(tomlContent)!;

The full results of what it produced are at https://github.com/linsomniac/tomlynexample

That includes the prompt I used, which is:

Please create a dotnet sample program that uses the library at https://github.com/xoofx/Tomlyn to parse the TOML file given on the command line. Please only use the Tomlyn library for parsing the TOML file. I don't have any dotnet tooling installed on my system, please let me know what is needed to compile this example when we get there. Please use an agent team consisting of a dotnet expert, a qa expert, a TOML expert a devils advocate and a dotnet on Linux expert.

I can't really comment on the code it produced (as I said, I don't use dotnet, I had to install dotnet on my system to try this), so I can't comment on the approach. 346 lines in Program.cs seems like a lot for an example TOML program, but I know Claude Code tends to do full error checking, etc, and it seems to have a lot of "pretty printing" code.

reply

upvote

by fxtentacle5 hours ago|

[-]

Same here. While LLMs sometimes work surprisingly well, I also encounter edge cases where they fail surprisingly badly multiple times per day. My guess is that other people maybe just don't bother to check what the AI says which would cause them to not notify omission errors.

Like when I was trying to find a physical store again with ChatGPT Pro 5.4 and asked it to prepare a list of candidates, but the shop just wasn't in the list, despite GPT claiming it to be exhaustive. When I then found it manually and asked GPT for advice on how I could improve my prompting in the future, it went full "aggressively agreeable" on me with "Excellent question! Now I can see exactly why my searches missed XY - this is a perfect learning opportunity. Here's what went wrong and what was missing: ..." and then 4 sections with 4 subsections each.

It's great to see the AI reflect on how it failed. But it's also kind of painful if you know that it'll forget all of this the moment the text is sent to me and that it will never ever learn from this mistake and do better in the future.

reply

upvote

by jmalicki4 hours ago|

[-]

"I also encounter edge cases where they fail surprisingly badly multiple times per day. "

If 80% of the time they 10x my output, and the other 20% I can say "well they failed, I guess this one I have to do manually" - that's still an absolutely massive productivity boost.

reply

upvote

by maccard3 hours ago|

[-]

But they don’t 10x my output - they write some code for a problem I/you have already thought about. The hard part isn’t writing the code, it never has been. It’s always been solving and breaking down the problem.

reply

upvote

by jmalicki3 hours ago|

[-]

It's not the hard part, just the tedious part that takes the most time.

reply

upvote

by senordevnyc3 hours ago|

[-]

Like when I was trying to find a physical store again with ChatGPT Pro 5.4 and asked it to prepare a list of candidates

I wonder if it was getting blocked on searches or something, and just didn't tell you.

reply

upvote

by pdntspa4 hours ago|

[-]

[dead]

reply

upvote

by ffsm84 hours ago|

[-]

> This morning I asked Claude to use a library to load a toml file in .net and print a value.

Legit this morning Claude was essentially unusable for me

I could explicitly state things it should adjust and it wouldn't do it.

Not even after specifying again, reverting everything eventually and reprompt from the beginning etc. Even super trivial frontend things like "extract [code] into a separate component"

After 30 minutes of that I relented and went on to read a book. After lunch I tried again and it's intelligence was back to normal

It's so uncanny to experience how much ita performance changes - I strongly suspect anthropic is doing something whenever it's intelligence drops so much, esp. Because it's always temporary - but repeatable across sessions if occuring... Until it's normal again

But ultimately just speculation, I'm just a user after all

reply

upvote

by krackers1 hours ago|

[-]

Here's an evil business idea: Use the LLMs to identify the users most likely to be "vocal influencers" and then prioritize resources for them, ensuring they get the best experience. You can engineer a bubble this way.

And then the next step is to dynamically vary resources based on prediction of user stickiness. User is frustrated and thinking of trying competitor -> allocate full resources. User is profiled as prone to gambling and will tolerate intermittent rewards -> can safely forward requests to gimped models. User is an resolute AI skeptic and unlikely to ever preach the gospels of vibecoding -> no need to waste resources on him.

reply

upvote

by maccard3 hours ago|

[-]

> Legit this morning Claude was essentially unusable for me I could explicitly state things it should adjust and it wouldn't do it.

Honestly, this is my experience. Every now and again it just completely self implodes and gives up, and I’m left to pick up the pieces. Look at the other replies who are making sure I’m using the agrntic loop/correct model/specific enough prompt - I don’t know what they’re doing but I would love to try the tools they’re using.

reply

upvote

by pton_xd31 minutes ago|

[-]

Last year I had a friend chatting with me about how Claude had rather quickly transformed their small coding shop, except that they noticed after 3pm it consistently became incredibly dumb. I kind of laughed at the time but you know what, who knows. There's very likely some load-balancing shenanigans going on behind the scenes.

reply

upvote

by Ancalagon19 minutes ago|

[-]

I had a similar experience between this weekend and last weekend!

Maybe Anthropic is trying to cut costs a little and we are all just gaslighting ourselves into thinking its our problem.

reply

upvote

by DougN76 hours ago|

[-]

I have similar experiences. It has worked about half the time, but the code has to be pretty simple. I’ve many experiences, where we work on something complicated for an hour, and it is good compiling code, but then an edge case comes to mind that I ask about and Claude tells me the whole approach is doomed and will never work for that case. It has even apologized a few times for misleading me :) I feel like it’s this weird mix of brilliant moron. But yeah, ask for a simple HTML page with a few fields and it rocks.

reply

upvote

by foobarchu5 hours ago|

[-]

The best experiences I have are those where I can describe what I want done with details. Rather than asking it add toml parsing, I would tell it to exactly which library to use ahead of time and reduce the number of decisions available to the model to make. Some of the most effective use-cases are when you have a reference to give it, e.g. "add x feature the same way as in this other project that is also in the workspace", or "make the changes I made to the contents of directory X in git commit <sha here>, but applied to directory Y instead". In both cases it's a lot of copy/paste then tweaking an obvious value (like replacing "dev" with "QA" everywhere).

I try to give the model as little freedom as possible. That usually means it's not being used for novel work.

reply

upvote

by alecbz4 hours ago|

[-]

> The best experiences I have are those where I can describe what I want done with details.

But that's the hard part! You can only eke out moderate productivity gains by automating the tedium of actually writing out the code, because it's a small fraction of software engineering.

reply

upvote

by senordevnyc3 hours ago|

[-]

I basically never have agents do work without using plan mode first (I use Cursor). So I'll start with a high-level like: "I want to plan out the architecture for feature X, I think there's a linear project already that we should update once our plan is done, but if not we'll need to create one. Here's a picture of my whiteboard, I think I also wrote some notes in Obsidian a few weeks ago that you should look up. Two things off the top of my head to think about are: should we do X or does it make more sense to do Y? And also, I'm worried about how this might interfere with the blahblah system, so can you do some research around that? <ramble ramble>"

Then it crawls around for awhile, does some web searches, fetches docs from here and there, whatever. Sometimes it'll ask me some questions. And then it'll finally spit out a plan. I'll read through it and just give it a massive dump of issues, big and small, more questions I have, whatever. (I'll also often be spinning off new planning sessions for pre-work or ancillary tasks that I thought of while reviewing that plan). No structure or anything, just brain dump. Maybe two rounds of that, but usually just one. And then I'll either have it start building, or I'll have it stash in the linear agent so I can kick it off later.

reply

upvote

by JakeStone5 hours ago|

[-]

If Claude ends up grabbing my C# TOML library, in my defense, I wrote it when the TOML format first came out over a dozen years ago, and never did anything more with it Sorry.

reply

upvote

by maccard5 hours ago|

[-]

Nope! https://github.com/xoofx/Tomlyn

The code it needed to write was:

    var model = TomlSerializer.Deserialize<TomlTable>(toml)!;

Which is in the readme of the repo. It could also have generated a class and deserialised into that. Instead it did something else (afraid I don’t have it handy sorry)

reply

upvote

by phromo5 hours ago|

[-]

I don't know why but I find performance on c#/.net be several generations behind. Sometimes right ofc but my general experience is if you pull the generation slot machine in just about any other language it will work better. I regularly do python, typescript, ruby and rust with a better experience. It's even hard to find benchmarks where csharp is included.

reply

upvote

by ball_of_lint4 hours ago|

[-]

Did you give claude access to run the compile step?

I remember having to write code on paper for my CS exams, and they expected it to compile! It was hard but I mostly got there. definitely made a few small mistakes though

reply

upvote

by maccard3 hours ago|

[-]

Yes.

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by wrs6 hours ago|

[-]

I’m honestly baffled by this. I don’t want to tell you “you’re holding it wrong” but if this is your normal experience there’s something weird happening.

Friday afternoon I made a new directory and told Claude Code I wanted to make a Go proxy so I could have a request/callback HTTP API for a 3rd party service whose official API is only persistent websocket connections. I had it read the service’s API docs, engage in some back and forth to establish the architecture and library choices, and save out a phased implementation plan in plan mode. It implemented it in four phases with passing tests for each, then did live tests against the service in which it debugged its protocol mistakes using curl. Finally I had it do two rounds of code review with fresh context, and it fixed a race condition and made a few things cleaner. Total time, two hours.

I have noticed some people I work with have more trouble, and my vague intuition is it happens when they give Claude too much autonomy. It works better when you tell it what to do, rather than letting it decide. That can be at a pretty high level, though. Basically reduce the problem to a set of well-established subproblems that it’s familiar with. Same as you’d do with a junior developer, really.

reply

upvote

by thwarted5 hours ago|

[-]

> it happens when they give Claude too much autonomy. It works better when you tell it what to do, rather than letting it decide. That can be at a pretty high level, though. Basically reduce the problem to a set of well-established subproblems that it’s familiar with. Same as you’d do with a junior developer, really.

Equating "junior developers" and "coding LLMs" is pretty lame. You handhold a junior developers so, eventually, you don't have to handhold anymore. The junior developer is expected to learn enough, and be trusted enough, to operate more autonomously. "Junior developers" don't exist solely to do your bidding. It may be valuable to recognize similarities between a first junior developer interaction and a first LLM interaction, but when every LLM interaction requires it to be handheld, the value of the iterative nature of having a junior developer work along side you is not at all equivalent.

reply

upvote

by wrs4 hours ago|

[-]

I didn’t say they are equivalent, nor do I in any way consider them equivalent. One is a tool, the other is a person.

I simply said the description of the problem should be broken down similar to the way you’d do it for a junior developer. As opposed to the way you’d express the problem to a more senior developer who can be trusted to figure out the right way to do it at a higher level.

reply

upvote

by maccard5 hours ago|

[-]

> I have noticed some people I work with have more trouble, and my vague intuition is it happens when they give Claude too much autonomy

What’s giving too much autonomy about

“Please load settings.toml using a library and print out the name key from the application table”? Even if it’s under specified, surely it should at least leave it _compiling_?

I’ve been posting comments like this monthly here, my experience has been consistently this with Claude, opencode, antigravity, cursor, and using gpt/opus/sonnet/gemini models (latest at time of testing). This morning was opus 4.6

reply

upvote

by linsomniac4 hours ago|

[-]

> Even if it’s under specified, surely it should at least leave it _compiling_?

Are you using Claude Code? Do yo have it configured so that you are not allowing it to run the build? Because I've observed that Claude Code is extremely good at making sure the code compiles, because it'll run a compile and address any compile errors as part of the work.

I just asked it to build a TOML example program in DotNet using Tomlyn, and when it was done I was able to run "./bin/Debug/net8.0/dotnettoml example.toml", it had already built it for me (I watched it run the build step as part of its work, as I mentioned it would do above).

reply

upvote

by maccard3 hours ago|

[-]

I am using Claude code. I didn’t explicitly tell it what the build command was (it’s dotnet build), and it didn’t ask. Thats not my fault.

> I’ve observed Claude code is extremely good at making sure the code compiles

My observation is that it’s fine until it’s absolutely not, and the agentic loop fails.

reply

upvote

by linsomniac2 hours ago|

[-]

>Thats not my fault.

I don't know that it's useful to assign blame here.

It probably is to your benefit, if you are a coding professional, to understand why your results are so drastically different from what others are seeing. You started this thread saying "I keep getting told I'll be amazed at what it can do, but the tools keep failing at the first hurdle."

I'm telling you that something is wrong, that is why you are getting poor results. I don't know what is wrong, but I've given you an example prompt and an example output showing that Claude Code is able to produce the exact output you were looking for. This is why a lot of people are saying "you'll be amazed at what it can do", and it points to you having some issue.

I don't know if you are running an ancient version of Claude Code, if you are not using Opus 4.6, you are not using "high" effort (those are what I'm using to get the results I posted elsewhere in reply to your comment), but something is definitely wrong. Some of what may be wrong is that you don't have enough experience with the tooling, which I'd understand if you are getting poor results; you have little (immediate) incentive to get more proficient.

As I said, I was able to tell Claude Code to do something like the example you gave, and it did it and it built, without me asking, and produced a working program on the first try.

reply

upvote

by maccard9 minutes ago|

[-]

> I don’t know that it’s useful to assign blame here

Oh - I’m blaming Claude not anyone else. I’ve tried again this evening and the same prompt (in the same directory on the same project) worked.

> i don’t know if you’re using an ancient version of Claude code,

I’m on a version from some time last week, and using opus 4.6

> This is why a lot of people are saying "you'll be amazed at what it can do", and it points to you having some issue.

If you look at my comments in these threads, I’ve had these issues and been posting about this for months. I’m still being told “ you’re using the wrong model or the wrong tool or you’re holding it wrong” but yet, here I am.

I’m using plan mode, clearly breaking down tasks and this happens to me basically every time I use the damn tool. Speaking to my team at work and friends in other workplaces, I hear the same thing. But yet we’re just using it wrong or doing something wrong,

Honestly, I genuinely think the people who are not having these experiences just… don’t notice that they are.

reply

upvote

by Kiro3 hours ago|

[-]

Not even the worst possible prompt would explain your unusual experience, so I don't think that's it either.

reply

upvote

by wrs4 hours ago|

[-]

There’s nothing wrong with it that I can see. Like I said, I’m a bit baffled at your experience. I will say, it’s not unusual for the initial output not to compile, but usually one short iteration later that’s fixed. Claude Code will usually even do that iteration by itself.

reply

upvote

by maccard3 hours ago|

[-]

> I will say, it’s not unusual for the initial output not to compile,

We’ve gone from “I’m baffled at your experience” to well yeah it often fails” in two sentences here…

reply

upvote

by wrs27 minutes ago|

[-]

Hmm…if you’re giving only one prompt to Claude Code, and allowing it only one output, then I’m no longer baffled at why you’re not getting good results. That’s not how it works. (That’s not how it works when I write code myself, either!)

reply

upvote

by maccard7 minutes ago|

[-]

I mean, I don’t know how much less scope I can give it. The next step is writing the 5 lines of code I want it to write.

I also clearly said I didn’t allow it one output, I gave it the compile error message, it changed a different line, I told it it was at the affected line and to check the docs. Claude code then tried to query the DLL for the function, abandoned that and then did something else incorrect.

I’m literally asking it to install a package and copy the example from the readme

reply

upvote

by saulpw55 minutes ago|

[-]

It's not unusual for my initial output (as a programmer) not to compile either. I wouldn't say I "failed" if I can then get it to compile. Which as people are saying, is what happens with Claude Code and Opus, either automatically or at most when I say "get it to compile".

reply

upvote

by maccard6 minutes ago|

[-]

But when it doesn’t compile for me, I don’t claim it’s finished.

reply

upvote

by shireboy5 hours ago|

[-]

Similar. I regularly use Github copilot (with claude models sometimes) and it works amazingly. But I see some who struggle with them. I have sort of learned to talk to it, understand what it is generating, and routinely use to generate fixes, whole features, etc. much much faster than I could before.

reply

upvote

by senordevnyc3 hours ago|

[-]

This is so perplexing to me. I've definitely hit these kinds of issues (which usually result in me cursing at the agent in all caps while telling it to get its shit together!), but it's almost always a long ways into a session where I know context rot is an issue, and the assumption it's making is a dumb one but it's also in the middle of a complex task...I just haven't had anything remotely like the situation you're describing, where Opus 4.6 can't make a simple change and verify that it compiles, can't look up docs, can't follow your instructions, etc. Bizarre.

reply

upvote

by roncesvalles6 hours ago|

[-]

Did you use the best model available to you (Opus 4.6)? There is a world of difference between using the highest model vs the fast one. The fast ones are basically useless and it's a shame that all these tools default to it.

reply

upvote

by fweimer5 hours ago|

[-]

To make this more confusing, there is a fast mode of Opus 4.6, which (as far as I understand it) is supposed to deliver the same results as the standard mode. It's much more expensive, but advertised as more interactive.

reply

upvote

by moffkalast4 hours ago|

[-]

Opus is not cost effective.

reply

upvote

by 4 hours ago|

[-]

deleted

reply

upvote

by senordevnyc3 hours ago|

[-]

Compared to what? I spent about $700 last month on Cursor, mostly on Opus 4.6. The time it saved me is at least 10-20x more valuable than that. I keep trying to use GPT-5.3 Codex, because it's like 40% cheaper, but then the quality doesn't always seem quite as good, and it's honestly not worth it to me to save a couple hundred bucks a month for something that's 10% worse. The multiplicative value of what I'm building means it's worth it to pay extra for the SOTA model in almost all cases.

reply