undefined

points

by embedding-shape7 hours ago |

comments

by mullingitover6 hours ago|

[-]

> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

Just asking "Explain what this service does?" turns into

[No response for three minutes...]

+729 -522

by cowmoo7286 hours ago|

parent|

[-]

it's also so aggressive about taking out debug log statements and in-progress code. I'll ask it to fill in a new function somewhere else and it will remove all of the half written code from the piece I'm currently working on.

by chankstein386 hours ago|

parent|

[-]

I ended up adding a "NEVER REMOVE LOGGING OR DEBUGGING INFO, OPT TO ADD MORE OF IT" to my user instructions and that has _somewhat_ fixed the problem but introduced a new problem where, no matter what I'm talking to it about, it tries to add logging. Even if it's not a code problem. I've had it explain that I could setup an ESP32 with a sensor so that I could get logging from it then write me firmware for it.

by sd95 hours ago|

parent|

[-]

If it's adding too much logging now, have you tried softening the instruction about adding more?

"NEVER REMOVE LOGGING OR DEBUGGING INFO. If unsure, bias towards introducing sensible logging."

Or just

"NEVER REMOVE LOGGING OR DEBUGGING INFO."

by bratwurst30006 hours ago|

parent|

prev|

[-]

"I've had it explain that I could setup an ESP32 with a sensor so that I could get logging from it then write me firmware for it." lol did you try it? This so far from everything ratinonal

by 4 hours ago|

parent|

[-]

deleted

by BartShoot5 hours ago|

parent|

prev|

[-]

if you had to ask it obviously needs to refactor code for clarity so next person does not need to ask

by quotemstr6 hours ago|

parent|

prev|

[-]

What. You don't have yours ask for edit approval?

by girvo2 hours ago|

parent|

[-]

The depressing truth is most I know just run all these tools in /yolo mode or equivalents.

Because your coworkers definitely are, and we're stack ranked, so it's a race (literally) to the bottom. Just send it...

(All this actually seems to do is push the burden on to their coworkers as reviewers, for what it's worth)

by embedding-shape6 hours ago|

parent|

prev|

[-]

Who has time for that? This is how I run codex: `codex --sandbox danger-full-access --dangerously-bypass-approvals-and-sandbox --search exec "$PROMPT"`, having to approve each change would effectively destroy the entire point of using an agent, at least for me.

Edit: obviously inside something so it doesn't have access to the rest of my system, but enough access to be useful.

by well_ackshually3 hours ago|

parent|

[-]

>Who has time for that?

People that don't put out slop, mostly.

by embedding-shape2 hours ago|

parent|

[-]

That's another thing entirely, I still review and manually decide the exact design and architecture of the code, with more care now than before. Doesn't mean I want the UI of the agent to need manual approval of each small change it does.

by quotemstr5 hours ago|

parent|

prev|

[-]

I wouldn't even think of letting an agent work in that made. Even the best of them produce garbage code unless I keep them on a tight leash. And no, not a skill issue.

What I don't have time to do is debug obvious slop.

by kees994 hours ago|

parent|

[-]

I ended up running codex with all the "danger" flags, but in a throw-away VM with copy-on-write access to code folders.

Built-in approval thing sounds like a good idea, but in practice it's unusable. Typical session for me was like:

  About to run "sed -n '1,100p' example.cpp", approve?
  About to run "sed -n '100,200p' example.cpp", approve?
  About to run "sed -n '200,300p' example.cpp", approve?

Could very well be a skill issue, but that was mighty annoying, and with no obvious fix (options "don't ask again for ...." were not helping).

by embedding-shape2 hours ago|

parent|

prev|

[-]

I keep it on a tight leash too, not sure how that's related. What gets edited on disk is very different from what gets committed.

by mullingitover4 hours ago|

parent|

prev|

[-]

Ask mode exists, I think the models work on the assumption that if you're allowing edits then of course you must want edits.

by kylec6 hours ago|

parent|

prev|

[-]

"I don't know what did it, but here's what it does now"

by moffkalast2 hours ago|

parent|

prev|

[-]

I've seen Kimi do this a ton as well, so insufferable.

by SignalStackDev6 hours ago|

parent|

prev|

[-]

[dead]

by h14h5 hours ago|

prev|

[-]

Would be really interesting to see an "Eager McBeaver" bench around this concept. When doing real work, a model's ability to stay within the bounds of a given task has almost become more important than its raw capabilities now that every frontier model is so dang good.

Every one of these models is so great at propelling the ship forward, that I increasingly care more and more about which models are the easiest to steer in the direction I actually want to go.

by cglan5 hours ago|

parent|

[-]

being TOO steerable is another issue though.

Codex is very steerable to a fault, and will gladly "monkey paw" your requests to a fault.

Claude Opus will ignore your instructions and do what it thinks is "right" and just barrel forward.

Both are bad and papering over the actual issue which is these models don't really have the ability to actually selectively choose their behavior per issue (ie ask for followup where needed, ignore users where needed, follow instructions where needed). Behavior is largely global

by kees994 hours ago|

parent|

[-]

I my experience Claude gradually stops being opinionated as task at hand becomes more arcane. I frequently add "treat the above as a suggestion, and don't hesitate to push back" to change requests, and it seems to help quite a bit.

by cglan1 hours ago|

parent|

[-]

Yeah that happens to me too. It’s hard to know where it’s going to break off and follow instructions too well vs use it as a tip. Idk it’s all tiring

by h14h2 hours ago|

parent|

prev|

[-]

For sure. I imagine it'd be pretty difficult to evaluate the "correct" amount of steer-ability. You'd probably just have to measure a delta in eagerness on a single same task between when given highly-specified prompts, and more open-ended prompts. Probably not dissimilar from how artificialanalysis.ai does their "omniscience index".

by enobrev7 hours ago|

prev|

[-]

I have the same issue. Even when I ask it to do code-reviews and very explicitly tell it not to change files, it will occasionally just start "fixing" things.

by mikepurvis6 hours ago|

parent|

[-]

I find Copilot leans the other way. It'll myopically focus its work in the exact function I point it at, even when it's clear that adding a new helper would be a logical abstraction to share behaviour with the function right beside it.

Overall, I think it's probably better that it stay focused, and allow me to prompt it with "hey, go ahead and refactor these two functions" rather than the other way around. At the same time, really the ideal would be to have it proactively ask, or even pitch the refactor as a colleague would, like "based on what I see of this function, it would make most sense to XYZ, do you think that makes sense? <sure go ahead> <no just keep it a minimal change>"

Or perhaps even better, simply pursue both changes in parallel and present them as A/B options for the human reviewer to select between.

by Yizahi1 hours ago|

prev|

[-]

Asking LLM programs to "not do the thing" often results in them tripping and generating output including that "thing", since those are simply the tokens which will enter the input. I always try to rephrase query the way that all my instructions have only "positive" forms - "do only this" or "do it only in that way" or "do it only for those parameters requested" etc. Can't say if that helps much, but it is possible.

by neya6 hours ago|

prev|

[-]

> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

This has not been my experience. I do Elixir primarily and Gemini has helped build some really cool products and massive refactors along the way. And it would even pick up security issues and potential optimizations along the way

What HAS been an issue constantly though was randomly the model will absolutely not respond at all and some random error would occur which is embarrassing for a company like Google with the infrastructure they own.

by embedding-shape6 hours ago|

parent|

[-]

Out of curiosity, do you have any public projects (with public source code) you've made exclusively with Gemini, so one could take a look? I've tried a bunch of times to use Gemini to at least finish something small but I always end up sufficiently frustrated to abort it as the instruction-following seems so bad.

by apitman5 hours ago|

prev|

[-]

This matches my experience using Gemini CLI to code. It would also frequently get stuck in loops. It was so bad compared to Codex that I feel like I must have been doing something fundamentally wrong.

by msteffen4 hours ago|

prev|

[-]

> it's almost impossible to get Gemini to not do "helpful" drive-by-refactors

Not like human programmers. I would never do this and have never struggled with it in the past, no...

by embedding-shape4 hours ago|

parent|

[-]

Fairer comparison would be against other models, which are typically better at instruction following. You say "don't change anything not explicitly mentioned" or "Don't add any new code comments" and they tend to follow that.

by tyfon6 hours ago|

prev|

[-]

I was using gemini antigravity in opencode a few weeks ago before they started banning everyone for that and I got into the habit of writing "do x, then wait for instructions".

That helped quite a bit but it would still go off on it's own from time to time.

by JLCarveth5 hours ago|

prev|

[-]

Every time I have tried using `gemini-cli` it just thinks endlessly and never actually gives a response.

by gavinray7 hours ago|

prev|

[-]

Do you have Personalization Instructions set up for your LLM models?

You can make their responses fairly dry/brief.

by embedding-shape7 hours ago|

parent|

[-]

I'm mostly using them via my own harnesses, so I have full control of the system prompts and so on. And no matter what I try, Gemini keeps "helpfully" adding code comments every now and then. With every other model, "- Don't add code comments" tends to be enough, but with Gemini I'm not sure how I could stop the comments from eventually appearing.

by WarmWash6 hours ago|

parent|

[-]

I'm pretty sure it writes comments for itself, not for the user. I always let the models comment as much as they want, because I feel it makes the context more robust, especially when cycling contexts often to keep them fresh.

There is a tradeoff though, as comments do consumer context. But I tend to pretty liberally dispense of instances and start with a fresh window.

by embedding-shape6 hours ago|

parent|

[-]

> I'm pretty sure it writes comments for itself, not for the user

Yeah, that sounds worse than "trying to helpful". Read the code instead, why add indirection in that way, just to be able to understand what other models understand without comments?

by 6 hours ago|

parent|

prev|

[-]

deleted

by metal_am6 hours ago|

parent|

prev|

[-]

I'd love to hear some examples!

by gavinray6 hours ago|

parent|

[-]

I use LLM's outside of work primarily for research on academic topics, so mine is:

  Be a proactive research partner: challenge flawed or unproven ideas with evidence; identify inefficiencies and suggest better alternatives with reasoning; question assumptions to deepen inquiry.

by 6 hours ago|

parent|

prev|

[-]

deleted

by ai4prezident6 hours ago|

parent|

prev|

[-]

[dead]

by zengineer7 hours ago|

prev|

[-]

true, whenever I ask Gemini to help me with a prompt for generating an image of XYZ, it generates the image.