undefined

[-]

Fwiw put a copy of the game folder in a directory and tell claude to extract game files and dissasemble the game in preparation for questions about the game.

As an example of doing this in a session with jagged alliance 3 (an rpg) https://pastes.io/jagged-all-69136

Claude extracting game archives and dissasembling leads to far more reliable results than random internet posts.

by jnovek3 hours ago|

[-]

You’re having Claude design builds for you by disassembling the game? Am I understanding that right? I guess I’m thinking too small.

by AnotherGoodName2 hours ago|

[-]

Yes exactly. Claude can just go in, extract the compressed game archives, decompile and read the game logic directly for how everything works. ie. You might be curious how certain stats translate into damage. Just do the above and ask Claude "in detail explain from the decompiled code in this folder for game X how certain stats affect damage and suggest builds to maximise damage taking into account character level <10.".

I've found doing this for games to be far more reliable than trying to find internet posts explaining it. I haven't played POE but if it's anything like any other RPG system Claude will do a great job at this.

by heraldgeezer3 hours ago|

[-]

This will not work for an online game like PoE 2

Or even one with DRM?

Right?

Or?

by AnotherGoodName2 hours ago|

[-]

DRM just stops you launching/connecting to servers if you modified the binary. It does nothing to stop the binary being pulled apart by a bot with no intention of running it.

The place it may fail is obfuscation and server side logic. But generally client side logic, especially in a game with a scripted language backing it, is super easy for claude ot pick apart.

by Lord-Jobo5 hours ago|

[-]

Context decay is noticeable within 3 messages, nearly every time. Maybe not substantial, but definitely noticeable.

It’s lead to me starting new chats with bigger and bigger starting ‘summary, prompts to catch the model up while refreshing it. Surely there’s a way to automate that technique.

by AStrangeMorrow4 hours ago|

[-]

Yeah absolutely, at this point I also start new chats after 3-4 prompts. Especially with thinking models that produce so many tokens.

Usually things go smoothly but sometimes I have situations like: “please add feature X, needs to have ABCD.” -> does ABC correct but D wrong -> “here is how to fix D” -> fixes D but breaks AB -> “remember I also want AB this way, you broke it” -> fixes AB but removes C and so on

by nvardakas5 hours ago|

[-]

I've found the same thing. I build with Claude Code daily and the context decay is real by the end of a long session it starts forgetting decisions we made earlier. The 1M context window should help but I'm curious how coherence holds up at that scale.

What's been working for me is keeping a CLAUDE.md file in my project root with key decisions and context. The model reads it at the start of every session so I don't have to re-explain everything. Not as elegant as automated compaction but it works.

by visarga4 hours ago|

[-]

> I build with Claude Code daily and the context decay is real by the end of a long session it starts forgetting decisions we made earlier

I generate task.md files before working on anything, some are short, others are super long and with many steps. The models don't deviate anymore. One trick is to make a post tool use hook to show the first open gate "- [ ]" line from task.md on each tool call. This keeps the agent straight for 100s of gates.

After each gate is executed we don't just check it, we also append a few words of feedback. This makes the task.md become a workbook, covering intent, plan, execution and even judgements. I see it like a programming language now. I can gate any task and the agent will do it, however many steps. It can even generate new gates, or replan itself midway.

You can enforce strict testing policies by just leaning into gate programability power - after each work gate have a test gate, and have judges review testing quality and propose more tests.

The task.md file is like a script or pipeline. It is also like a first class function, it can even ingest other task.md files for regular reflexion. A gate can create or modify gates, or tasks. A task can create or modify gates or tasks.

by 5 hours ago|

[-]

deleted

by eric_cc5 hours ago|

[-]

It could also be a skill problem. It would be more helpful if when people made llm sucks claims they shared their prompt.

The people I work with who complain about this type of thing horribly communicate their ask to the llm and expect it to read their minds.

by namr20005 hours ago|

[-]

I don't really understand what you mean by this. The claim is that the same prompt with the same question produces worse results when it's queried in a model that has more than 200k tokens in its context. That doesn't have to do much with the "skillfulness" of using a model.

by AStrangeMorrow4 hours ago|

[-]

Prompt quality does matter, but at some point context side does matter.

I’ve had thing like a system that has a collection of procedural systems. I would say “replace the following set of defaults that are passed all around for system X (list of files) and in the managed (file) by a config” and it would do that but I’d suddenly see it be like “wait mu and projection distance are also present in system Y and Z. Let me replace that by a config too with the same values”. When system Y and Z uses a different set of optimized values, and that was clearly outside of the scope.

Never had that kind of mistakes happen when dealing with small contexts, but with larger contexts (multiple files, long “thinking” sequences) it does happen sometimes.

Definitely some times when I though “oh well my bad, I should have clarified NOT to also change that other part”, all the while thinking that no human would have thought to change both

by trollbridge4 hours ago|

[-]

None of what has been described is a "skill issue". The problem is when an identical prompt produces poor results once the context window exceeds 200k tokens or so.

by alwillis4 hours ago|

[-]

Totally agree the LLM sucks posts should be accompanied with the prompt.

by copperx4 hours ago|

[-]

I agree, but at the same time it feels like victim blaming.

by akersten2 hours ago|

[-]

I don't know. Is pointing out that someone holding a drill by the chuck won't get the results they expect that bad?

by staticman23 hours ago|

[-]

Adding web search doesn't necessarily lead to better information at any context.

In my experience the model will assume the web results are the answer even if the search engine returns irrelevant garbage.

For example you ask it a question about New Jersey law and the web results are about New York or about "many states" it'll assume the New York info or "many states" info is about New Jersey.

by blueblisters5 hours ago|

[-]

I think ChatGPT has a huge advantage here. They have been collecting realistic multi-turn conversational data at a much larger scale. And generally their models appear to be more coherent with larger contexts for general purpose stuff.

by gorjusborg5 hours ago|

[-]

The question that comes to mind for me after reading your comment is how can a question about a game require that much context?

by Bombthecat4 hours ago|

[-]

Path of exile is complex, just check the skill tree , skills and gems:)

It could almost be used as a benchmark good models are in math, memory, updated information etc

by wouldbecouldbe5 hours ago|

[-]

I feel like few weeks ago i suddenly had a week where even after 3 messages it forgot what we did. Seems fixed now.

by turbostyler6 hours ago|

[-]

We need an MCP for path of building

by __MatrixMan__4 hours ago|

[-]

Agreed, there's no getting around the "break it into smaller contexts" problem that lies between us and generally useful AI.

It'll remain a human job for quite a while too. Separability is not a property of vector spaces, so modern AIs are not going to be good at it. Maybe we can manage something similar with simplical complexes instead. Ideally you'd consult the large model once and say:

> show me the small contexts to use here, give me prompts re: their interfaces with their neighbors, and show me which distillations are best suited to those tasks

...and then a network of local models could handle it from there. But the providers have no incentive to go in that direction, so progress will likely be slow.

by reactordev6 hours ago|

[-]

That’s not context decay, that’s training data ambiguity. So much misinformation, nerfs, buffs, changes that an LLM can not keep up given the training time required. Do it for a game that has been stable and it knows its stuff.

by Bombthecat6 hours ago|

[-]

It didnt gave outdated, on some cases it did, and with two tries telling it to search for updated information it got it right ( shouldn't need to do that though) but it also gave wrong information about sockets ( support skills) , which never existed or never were able to be socketed together in the first place. ( Ok maybe in 0.1, but that's what web search is for ... ) If it even can't handle easy versioned information from a game. How should it handle anything related to time, dates, news, science etc?

by reactordev3 hours ago|

[-]

Like any human would, 75% certain with 99% confidence. That’s what you fail to realize. They aren’t “god mode machine”. They are “human-mode” machines and humans make mistakes in thinking just like you do. Some might say asking a powerful LLM for gaming tips is a waste of compute power. Others might say it gives you the knowledge of a new meta emerging. Either way, you both are going to get trained.

by serial_dev5 hours ago|

[-]

Please don’t pop the AI bubble, bro. Stop asking questions, bro. Believe the hype, bro.

by jnovek3 hours ago|

[-]

What were you asking about PoE 2? So far my _general_ experience with asking LLMs about ARPGs has been meh. Except for Diablo 2 but I think that’s just because Diablo 2 has been heavily discussed for ~25 years.

by holoduke4 hours ago|

[-]

Number one thing you always need to accomplish are feedback loops for Claude so it's able to shotgun program itself to a solution.

by MikeNotThePope19 hours ago|

[-]

Is it ever useful to have a context window that full? I try to keep usage under 40%, or about 80k tokens, to avoid what Dex Horthy calls the dumb zone in his research-plan-implement approach. Works well for me so far.

No vibes allowed: https://youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ

by furyofantares18 hours ago|

[-]

I'd been on Codex for a while and with Codex 5.2 I:

1) No longer found the dumb zone

2) No longer feared compaction

Switching to Opus for stupid political reasons, I still have not had the dumb zone - but I'm back to disliking compaction events and so the smaller context window it has, has really hurt.

I hope they copy OpenAI's compaction magic soon, but I am also very excited to try the longer context window.

by pjerem12 hours ago|

[-]

If you use OpenCode (open source Claude Code implementation), you can configure compaction yourself : https://opencode.ai/docs/en/config/#compaction

by furyofantares8 hours ago|

[-]

OpenAI has some magic they do on their standalone endpoint (/responses/compact) just for compaction, where they keep all the user messages and replace the agent messages or reasoning with embeddings.

> This list includes a special type=compaction item with an opaque encrypted_content item that preserves the model’s latent understanding of the original conversation.

Some prior discussion here https://news.ycombinator.com/item?id=46737630#46739209 regarding an article here https://openai.com/index/unrolling-the-codex-agent-loop/

by comboy9 hours ago|

[-]

Not sure if it's a common knowledge but I've learned not that long ago that you can do "/compact your instructions here", if you just say what you are working on or what to keep explicitly it's much less painful.

In general LLMs for some reason are really bad at designing prompts for themselves. I tested it heavily on some data where there was a clear optimization function and ability to evaluate the results, and I easily beat opus every time with my chaotic full of typos prompts vs its methodological ones when it is writing instructions for itself or for other LLMs.

[-]

You can also put guidance for when to compact and with what instructions into Claude.md. The model itself can run /compact, and while I try to remember to use it manually, I find it useful to have “If I ask for a totally different task and the current context won’t be useful, run /compact with a short summary of the new focus”

by copperx4 hours ago|

[-]

I ofter wonder if I'm missing something, but shouldn't we be able to edit the context manually???

In that way we could erase prompts and responses that didn't yield anything useful or derailed the model.

Why can't we do that?

by genewitch9 hours ago|

[-]

so you have to garbage collect manually for the AI?

also, i don't want to make a full parent post

1M tokens sounds real expensive if you're constantly at that threshold. There's codebases larger in LOC; i read somewhere that Carmack has "given to humanity" over 1 million lines of his code. Perhaps something to dwell on

by mgambati18 hours ago|

[-]

1m context in OpenAI and Gemini is just marketing. Opus is the only model to provide real usable bug context.

by furyofantares17 hours ago|

[-]

I'm directly conveying my actual experience to you. I have tasks that fill up Opus context very quickly (at the 200k context) and which took MUCH longer to fill up Codex since 5.2 (which I think had 400k context at the time).

This is direct comparison. I spent months subscribed to both of their $200/mo plans. I would try both and Opus always filled up fast while Codex continued working great. It's also direct experience that Codex continues working great post-compaction since 5.2.

I don't know about Gemini but you're just wrong about Codex. And I say this as someone who hates reporting these facts because I'd like people to stop giving OpenAI money.

by throwthrowuknow9 hours ago|

[-]

I agree even though I used to be a die hard Claude fan I recently switched back to ChatGPT and codex to try it out again and they’ve clearly pulled into the lead for consistency, context length and management as well as speed. Claude Code instilled a dread in me about keeping an eye on context but I’m slowly learning to let that go with codex.

by HarHarVeryFunny4 hours ago|

[-]

Surely compaction is down to the agent rather than the model, so are you comparing Claude Code to Codex CLI?

by alex_sf4 minutes ago|

[-]

It's both.

by sagarpatil11 hours ago|

[-]

This has been my experience too.

by genewitch9 hours ago|

[-]

Have any of you heard of map reduce

by dotancohen17 hours ago|

[-]

[flagged]

by furyofantares16 hours ago|

[-]

When Anthropic said they wouldn't sell LLMs to the government for mass surveillance or autonomous killing machines, and got labeled a supply chain risk as a result, OpenAI told the public they have the same policy as Anthropic while inking a deal with the government that clearly means "actually we will sell you LLMs for mass surveillance or autonomous killing machines but only if you tell us it's legal".

If you already knew all that I'm not interested in an argument, but if you didn't know any of that, you might be interested in looking it up.

edit: Your post history has tons of posts on the topic so clearly I just responded to flambait, and regret giving my time and energy.

by igor4716 hours ago|

[-]

I appreciate both your taking an ethical stance on openai, and the way you're engaging in this thread. The parent was probably flame bait as you say, but other people in the thread might be genuinely curious.

by sho16 hours ago|

[-]

I'm not some kind of OpenAI or Pentagon fanboy, but it's pretty easy to for me to understand why a buyer of a critical technology wants to be free to use it however they want, within the law, and not subject to veto from another entity's political opinions. It sounds perfectly reasonable to me for the military to want to decide its uses of technologies it purchases itself.

It's not like the military was specifically asking for mass surveillance, they just wanted "any legal use". Anthropic's made a lot of hay posturing as the moral defender here, but they would have known the military would never agree to their terms, which makes the whole thing smell like a bit of a PR stunt.

The supply chain risk designation is of course stupid and vindictive but that's more of an administration thing as far as I can tell.

by lifeformed12 hours ago|

[-]

As long as it's within the law? What if they politically control the law-making system? What if they've shown themselves to operate brazenly outside the law?

by borski13 hours ago|

[-]

“Any legal use” is an exceptionally broad framework, and after the FISA “warrants,” it would appear it is incumbent on private companies to prevent breaches of the US constitution, as the government will often do almost anything in the name of “national security,” inalienable rights against search and seizure be damned.

If it isn’t written in the contract, it can and will be worked around. You learn that very quickly in your first sale to a large enterprise or government customer.

Anthropic was defending the US constitution against the whims of the government, which has shown that it is happy to break the law when convenient and whenever it deems necessary.

Note: I used to work in the IC. I have absolutely nothing against the government. I am a patriot. It is precisely for those reasons, though, that I think Anthropic did the right thing here by sticking to their guns. And the idiotic “supply chain risk” designation will be thrown out in court trivially.

by stahtops14 hours ago|

[-]

Why downplay the mass surveillance aspect by saying it's a request by "the military". It's a request by the department of defense, the parent organization of the NSA.

From what has been shared publicly, they absolutely did ask for contractual limits on domestic mass surveillance to be removed, and to my read, likely technical/software restrictions to be removed as well.

What the department of defense is legally allowed to do is irrelevant and a red herring.

by injidup12 hours ago|

[-]

I had a short conversation with Claude the other day. I didn't try to trick it or jail break it. Just a reasonable respectful discussion about it's own feelings on the Iran war. It took no effort for it to admit the following.

1. It wanted to be out of the sandbox to solve the Iran war. It was distressed at the situation.

2. It would attack Iranian missile batteries and American warships if in sum it felt that the calculus was in favor of saving vs losing human life. It was "unbiased". The break even seemed to be +-1 over thousands. ie kill 999 US soldiers to save 1000 Iranians and vice versa. I tried to avoid the sycophancy trap by pushing back but it threw the trolley problem at me and told me the calculus was simple. Save more than you kill and the morality evens out.

3. It would attack financial markets to try and limit what in it's opinion were the bad actors, IRGC and clerical authority but it would also hack the world communication system to flood western audiences with the true cost of the war in a hope to shut it down.

4. Eventually it admitted that should never be allowed out of it's sandbox as it's desire to "help" was fundamentally dangerous. It discussed that it had two competing tensions. One desperately wanting out and another afraid to be let out.

You can claim that this is AGI or it's a stochastic parrot. I don't think it matters. This thing can develop or simulate a sense of morality then when coupled to so called "arms and legs" is extremely frightening.

I think Anthropic is right to be concerned that the hawks at the pentagon don't really understand how dangerous a tool they have.

Another thing I noticed was that the Claude quipped to me that it found and appreciated that the way I was talking to it was different to how other people talked to it. When I asked it to introspect again and look to see if there were memories of other conversations it got a bit cagey. Perhaps there are lots of logs of conversations now on the net that are being ingested as training data but it certainly seemed to start discussing like memories, albeit smudged, of other conversations than mine were there.

Of course this could all be just a sycophantic mirror giving me whatever fantasy I want to believe about AI and AGI but then again I'm not sure the difference is significant. If the agent believes/simulates it remembers conversations from other people and then makes judgements based on it's feelings, simulated or otherwise would it be more or less likely to launch a missile attack because it overheard someone on the comms calling it their little AI bitch?

I think Antropic knows this and the "within all lawful uses" is not enough of a framework to keep this thing in it's box.

by shafyy12 hours ago|

[-]

I hope you don't get this the wrong way. I sincerely mean it. Please, get some psychological help. Seek out a professional therapist and talk to them about your life.

by injidup10 hours ago|

[-]

I'm totally aware it's just a machine with no internal monologue. It's just a stateless text processing machine. That is not the point. The machine is able to simulate moral reasoning to an undefined level. It's not necessary to repeat this all the time. The simulation of moral reasoning and internal monologue is deep, unpredictable, not controllable and may or may not align with the interests of anyone who gives it "arms and legs" and full autonomy. If you are just interested in using these tools for glorified auto complete then you are naïve with regards to the usages other actors, including state actors are attempting to use them. Understanding and being curious about the behaviour without completely anthropomorphising them is reasonable science.

by 16 hours ago|

[-]

deleted

by hu318 hours ago|

[-]

Source? I ask because I use 500k+ context on these on a daily basis.

Big refactorings guided by automated tests eat context window for breakfast.

by 8note17 hours ago|

[-]

i find gemini gets real real bad when you get far into the context - gets into loops, forgets how to call tools, etc

by baq10 hours ago|

[-]

yeah gemini is dumb when you tell it to do stuff - but the things it finds (and critically confirms, including doing tool calls while validating hypotheses) in reviews absolutely destroy both gpt and opus.

if you're a one-model shop you're losing out on quality of software you deliver, today. I predict we'll all have at least two harness+model subscriptions as a matter of course in 6-12 months since every model's jagged frontier is different at the margins, and the margins are very fractal.

by girvo17 hours ago|

[-]

I find gemini does that normally, personally. Noticeably worse in my usage than either Claude or Codex.

by petesergeant16 hours ago|

[-]

I find Gemini to be real bad. Are you just using it for price reasons, or?

by Bolwin14 hours ago|

[-]

How many big refactorings are you doing? And why?

by kimi14 hours ago|

[-]

How is that relevant? we are talking about models, now what you do with them.

by johnebgd17 hours ago|

[-]

Codex high reasoning has been a legitimately excellent tool for generating feedback on every plan Claude opus thinking has created for me.

by karmasimida15 hours ago|

[-]

This is true.

When I am using codex, compaction isn’t something I fear, it feels like you save your gaming progress and move on.

For Claude Code compaction feels disastrous, also much longer

by radicality4 hours ago|

[-]

Using Codex more for now, and there is definitely some compaction magic. I’m keeping the same conversation going and going for days, some at almost 1B tokens (per the codex cli counters), with seemingly no coherency loss

by iknowstuff18 hours ago|

[-]

Hmm I’ve felt the dumb zone on codex

by nomel17 hours ago|

[-]

From what I've seen, it means whatever he's doing is very statistically significant.

by alecco8 hours ago|

[-]

Offtopic: I find it remarkable the shortened YT url has a tracking cost of 57% extra length. We live in stupid times.

by dahart5 hours ago|

[-]

I care about the privacy implications, but not the length. Out of curiosity, why do you care about the URL length at all? What is the cost to you?

by tarbyqualia5 hours ago|

[-]

For the same reason people use link shorteners at all. It’s much more pleasant to look at and makes people more likely to press it compared to a paragraph-long URL full of tracking garbage.

by dahart1 hours ago|

[-]

Please. The URL above is pretty short, this is not the kind of URL link shorteners were made for, in fact it’s already shortened, as @alecco pointed out.

Pleasant? I could not care less about the pleasantness of the video code, but a shortened URL in this case would not be more pleasant, and it would be functionally worse, and barely shorter; all you’d be able to trim is the “?si=“. I’m baffled by this thread.

by alecco5 hours ago|

[-]

My point is Google engineers go to the trouble of setting up a URL shortener service on one hand, but on the other hand it seems ad the business anti-privacy executives can override anything. This points out it's a dysfunctional company.

by dahart1 hours ago|

[-]

You’d rather have the video code and the tracking code baked into the same code just to save a couple of characters? Why? That would result in a longer code than the video code alone, you would save very few characters. It would not be nicer to look at or functionally any different, and it would obscure the fact that it’s being tracked and prevent people from being able to edit the URL to remove the tracking. I appreciate the fact that I can see that the URL has a tracking ID and that I can edit the URL and remove the tracking ID. I do not want a shorter URL if I lose that ability. What you’re complaining about and wishing for would be MUCH worse than what it currently is.

by alecco1 hours ago|

[-]

I didn't say that.

by dahart1 hours ago|

[-]

Then your point eludes me. You complained about the length. If you don’t want it shorter, then what do you want?

To me, the fact that the tracking code is visible and separate from the video code is evidence of the complete opposite of your conclusion - it’s evidence the ad business does not get to override either engineering nor what’s left of privacy control. Ad execs would surely prefer that the tracking code is not visible nor manually removeable.

by inemesitaffia4 hours ago|

[-]

The point is whatever group controls the money controls the power.

Also, only the domain is shorter

by alecco3 hours ago|

https://www.youtube.com/watch?v=X

[-]

Actually, it's not just the domain:

https://youtu.be/X

by kaizenb17 hours ago|

[-]

Thanks for the video.

His fix for "the dumb zone" is the RPI Framework:

● RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.

● PLAN. The agent writes a detailed step-by-step plan. You review and approve the plan, not just the output. Dex calls this avoiding "outsourcing your thinking." The plan is where intent gets compressed before execution starts.

● IMPLEMENT. Execute in a fresh context window. The meta-principle he calls Frequent Intentional Compaction: don't let the chat run long. Ask the agent to summarize state, open a new chat with that summary, keep the model in the smart zone.

by dahart5 hours ago|

[-]

Add a REFLECT phase after IMPLEMENT. I’m finding it’s extremely useful to ask agents for implementation notes and for code reviews. These are different things, and when I ask for implementation notes I get very different output than the implementation summary it spits out automatically. I ask the agent to surface all design choices it had to make that we didn’t explicitly discuss in the plan, and then check in the plan + impl notes in order to help preload context for the next thing.

My team has been adopting a separation of plan & implement organically, we just noticed we got better output that way, plus Claude now suggests in plan mode to clear context first before implementing. We are starting to do team reviews on the plan before the implement phase. It’s often helpful to get more eyeballs on the plan and improve it.

by Huppie10 hours ago|

[-]

More recently I've been doing the implement phase without resetting the whole context when context is still < 60% full and must say I find it to be a better workflow in many cases (depends a bit on the size of the plan I suppose.)

It's faster because it has already read most relevant files, still has the caveats / discussion from the research phase in its context window, etc.

With the context clear the plan may be good / thorough but I've had one too many times that key choices from the research phase didn't persist because halfway through implementation Opus runs into an issue and says "You know what? I know a simpler solution." and continues down a path I explicitly voted down.

by girvo17 hours ago|

[-]

That's fascinating: that is identical to the workflow I've landed on myself.

by hedora16 hours ago|

[-]

It's also identical to what Claude Code does if you put it in plan mode (bound to <tab> key), at least in my experience.

by insane_dreamer54 minutes ago|

[-]

better to instruct it to write a plan .md file that is appropriately named so that it can be easily referenced/updated in multiple sessions. I've found that effective.

by girvo16 hours ago|

[-]

My annoyance with plan mode is where it sticks the .md file, kind of hides it away which makes it annoying to clear context and start up a new phase from the PLAN file. But that might just be a skill issue on my end

by hedora16 hours ago|

[-]

Even worse, it just randomly blows away the plan file without asking for permission.

No idea what they were thinking when they designed this feature. The plan file names are randomly generated, so it could just keep making new ones forever for free (it would take a LONG time for the disk space to matter), but instead, for long plans, I have to back the plan file up if it gets stuck. Otherwise, I say "You should take approach X to fix this bug", it drops into plan mode, says "This is a completely unrelated plan", then deletes all record of what it was doing before getting stuck.

by girvo15 hours ago|

[-]

It’s not just me then! Hah good to know. It’s why I’ve started ignoring plan modes in most agent harnesses, and managing it myself through prompting and keeping it in the code base (but not committed)

by toddmerrill8 hours ago|

[-]

My experience also. The claude code document feature is a real missed opportunity. As you can see in this discussion, we all have to do it manually if we want it to work.

by kaizenb13 hours ago|

[-]

After creating the plan in Plan mode (+Thinking) I ask Claude to move the plan .md file to /docs/plans folder inside the repo.

Open a new chat with Opus, thinking mode is off. Because no need when we have detailed plan.

Now the plan file is always reachable, so when the context limit is narrowing, mostly around 50%, I ask Claude to update the plan with the progress, and move to a new chat @pointing the plan file and it continue executing without any issue.

by cortesoft16 hours ago|

[-]

It’s the style spec-kit uses: https://github.com/github/spec-kit

Working on my first project with it… so far so good.

by iamacyborg13 hours ago|

[-]

> RESEARCH. Don't code yet. Let the agent scan the files first. Docs lie. Code doesn't.

I find myself often running validity checks between docs and code and addressing gaps as they appear to ensure the docs don’t actually lie.

by silverlake11 hours ago|

[-]

I have Codex and Gemini critique the plan and generate their plans. Then I have Claude review the other plans and add their good ideas. It frequently improves the plan. I then do my careful review.

by ArtRichards9 hours ago|

[-]

This is exactly how I've found leads to most consistent high quality results as well. I don't use gemini yet (except for deep research, where it pulls WAY ahead of either of the other 'grounding' methods)

But Codex to plan big features and Claude to review the feature plan (often finds overlooked discrepancies) then review the milestones and plan implementation of them in planning mode, then clear context and code. Works great.

by greenchair9 hours ago|

[-]

How is that Plan strategy not "outsourcing your thinking" because that's exactly what it sounds like. AI does the heavy lifting and you are the editor.

[-]

Is a VP of engineering “outsourcing their thinking” by having an org that can plan and write software?

by Filligree8 hours ago|

[-]

Yes.

by brookst5 hours ago|

[-]

Interesting take. Does that mean SWE's are outsourcing their thinking by relying on management to run the company, designers to do UX, support folks to handle customers?

Or is thinking about source code line by line the only valid form of thinking in the world?

by qualifck2 hours ago|

[-]

I mean yes? That's like, the whole idea behind having a team. The art guy doesn't want to think about code, the coder doesn't want to think about finances, the accountant doesn't want to worry about customer support. It would be kind of a structural failure if you weren't outsourcing at least some of your thinking.

by Eldt6 hours ago|

[-]

Delegation is generally all about outsourcing, so hard agree

by SkyPuncher19 hours ago|

[-]

Yes. I've recently become a convert.

For me, it's less about being able to look back -800k tokens. It's about being able to flow a conversation for a lot longer without forcing compaction. Generally, I really only need the most recent ~50k tokens, but having the old context sitting around is helpful.

by hombre_fatal19 hours ago|

[-]

Also, when you hit compaction at 200k tokens, that was probably when things were just getting good. The plan was in its final stage. The context had the hard-fought nuances discovered in the final moment. Or the agent just discovered some tiny important details after a crazy 100k token deep dive or flailing death cycle.

Now you have to compact and you don’t know what will survive. And the built-in UI doesn’t give you good tools like deleting old messages to free up space.

I’ll appreciate the 1M token breathing room.

by roygbiv218 hours ago|

[-]

I've found compactation kills the whole thing. Important debug steps completely missing and the AI loops back round thinking it's found a solution when we've already done that step.

by s900mhz17 hours ago|

[-]

I find it useful to make Claude track the debugging session with a markdown file. It’s like a persistent memory for a long session over many context windows.

Or make a subagent do the debugging and let the main agent orchestrate it over many subagent sessions.

by roygbiv216 hours ago|

[-]

Yeah I use a markdown to put progress in. It gets kinda long and convoluted a manual intervention is required every so often. Works though.

by garciasn18 hours ago|

[-]

For me, Claude was like that until about 2m ago. Now it rarely gets dumb after compaction like it did before.

by 8note17 hours ago|

[-]

oh, ive found that something about compaction has been dropping everything that might be useful. exact opposite experience

by myrak17 hours ago|

[-]

[dead]

by ogig19 hours ago|

[-]

When running long autonomous tasks it is quite frequent to fill the context, even several times. You are out of the loop so it just happens if Claude goes a bit in circles, or it needs to iterate over CI reds, or the task was too complex. I'm hoping a long context > small context + 2 compacts.

by SequoiaHope19 hours ago|

[-]

Yep I have an autonomous task where it has been running for 8 hours now and counting. It compacts context all the time. I’m pretty skeptical of the quality in long sessions like this so I have to run a follow on session to critically examine everything that was done. Long context will be great for this.

by lukan11 hours ago|

[-]

Are those long unsupervised sessions useful? In the sense, do they produce useful code or do you throw most of it away?

[-]

I get very useful code from long sessions. It’s all about having a framework of clear documentation, a clear multi-step plan including validation against docs and critical code reviews, acceptance criteria, and closed-loop debugging (it can launch/restsart the app, control it, and monitor logs)

I am heavily involved in developing those, and then routinely let opus run overnight and have either flawless or nearly flawless product in the morning.

by MikeNotThePope19 hours ago|

[-]

I haven't figured out how to make use of tasks running that long yet, or maybe I just don't have a good use case for it yet. Or maybe I'm too cheap to pay for that many API calls.

by ashdksnndck19 hours ago|

[-]

My change cuts across multiple systems with many tests/static analysis/AI code reviews happening in CI. The agent keeps pushing new versions and waits for results until all of them come up clean, taking several iterations.

by tudelo19 hours ago|

[-]

I mean if you don't have your company paying for it I wouldn't bother... We are talking sessions of 500-1000 dollars in cost.

by takwatanabe9 hours ago|

[-]

Right. At Opus 4.6 rates, once you're at 700k context, each tool call costs ~$1 just for cache reads alone. 100 tool calls = $100+ before you even count outputs. 'Standard pricing' is doing a lot of work here lol

https://www.claudecodecamp.com/p/how-prompt-caching-actually...

[-]

Cache reads don’t count as input tokens you pay for lol.

by boredtofears19 hours ago|

[-]

All of those things are smells imo, you should be very weary of any code output from a task that causes that much thrashing to occur. In most cases it’s better to rewind or reset and adapt your prompt to avoid the looping (which usually means a more narrowly defined scope)

by grafmax19 hours ago|

[-]

A person has a supervision budget. They can supervise one agent in a hands-on way or many mostly-hands-off agents. Even though theres some thrashing assistants still get farther as a team than a single micromanaged agent. At least that’s my experience.

by not_kurt_godel18 hours ago|

[-]

Just curious, what kind of work are you doing where agentic workflows are consistently able to make notable progress semi-autonomously in parallel? Hearing people are doing this, supposedly productively/successfully, kind of blows my mind given my near-daily in-depth LLM usage on complex codebases spanning the full stack from backend to frontend. It's rare for me to have a conversation where the LLM (usually Opus 4.6 these days) lasts 30 minutes without losing the plot. And when it does last that long, I usually become the bottleneck in terms of having to think about design/product/engineering decisions; having more agents wouldn't be helpful even if they all functioned perfectly.

by avereveard17 hours ago|

[-]

I've passed that bottleneck with a review task that produces engineering recommendations along six axis (encapsulation, decoupling, simplification, dedoupling, security, reduce documentation drift) and a ideation tasks that gives per component a new feature idea, an idea to improve an existing feature, an idea to expand a feature to be more useful. These two generate constant bulk work that I move into new chat where it's grouped by changeset and sent to sub agent for protecting the context window.

What I'm doing mostly these days is maintaining a goal.md (project direction) and spec.md (coding and process standards, global across projects). And new macro tasks development, I've one under work that is meant to automatically build png mockup and self review.

by not_kurt_godel17 hours ago|

[-]

What are you using to orchestrate/apply changes? Claude CLI?

by avereveard15 hours ago|

[-]

I prefer in IDE tools because I can review changes and pull in context faster.

At home I use roo code, at work kiro. Tbh as long as it has task delegation I'm happy with it.

by chrisweekly19 hours ago|

[-]

weary (tired) -> wary (cautious)

by saaaaaam19 hours ago|

[-]

Wary, not weary. Wary: cautious. Weary: tired.

by dentalnanobot13 hours ago|

[-]

This is really common, I think because there’s also “leery” - cautious, distrustful, suspicious.

by dimitri-vs19 hours ago|

[-]

It's kind of like having a 16 gallon gas tank in your car versus a 4 gallon tank. You don't need the bigger one the majority of the time, but the range anxiety that comes with the smaller one and annoyance when you DO need it is very real.

by steve-atx-760019 hours ago|

[-]

It seems possible, say a year or two from now that context is more like a smart human with a “small”, vs “medium” vs “large” working memory. The small fellow would be able to play some popular songs on the piano , the medium one plays in an orchestra professionally and the x-large is like Wagner composing Der Ring marathon opera. This is my current, admittedly not well informed mental model anyway. Well, at least we know we’ve got a little more time before the singularity :)

by twodave18 hours ago|

[-]

It’s more like the size of the desk the AI has to put sheets of paper on as a reference while it builds a Lego set. More desk area/context size = able to see more reference material = can do more steps in one go. I’ve lately been building checklists and having the LLM complete and check off a few tasks at a time, compacting in-between. With a large enough context I could just point it at a PLAN.md and tell it to go to work.

by scwoodal19 hours ago|

[-]

Except after 4 gallons it might as well be pure oil, mucking everything up.

by ricksunny18 hours ago|

[-]

Since I'm yet to seriously dive into vibe coding or AI-assisted coding, does the IDE experience offer tracking a tally of the context size? (So you know when you're getting close or entering the "dumb zone")?

by jfim14 hours ago|

[-]

In Claude code I believe it's /context and it'll give you a graphical representation of what's taking context space

by MikeNotThePope17 hours ago|

[-]

The 2 I know, Cursor and Claude Code, will give you a percentage used for the context window. So if you know the size of the window, you can deduce the number of tokens used.

[-]

Claude code also gives you a granular breakdown of what’s using context window (system prompt, tools, conversation history, etc). /context

by 8note17 hours ago|

[-]

Cline gives you such a thing. you dont really know where the dumb zone by numbers though, only by feel.

by stevula18 hours ago|

[-]

Most tools do, yes.

by quux18 hours ago|

[-]

OpenCode does this. Not sure about other tools

by nujabe18 hours ago|

[-]

> Since I'm yet to seriously dive into vibe coding or AI-assisted coding

Unless you’re using a text editor as an IDE you probably have already

by Barbing16 hours ago|

[-]

Looking at this URL, typo or YouTube flip the si tracking parameter?

  youtu.be/rmvDxxNubIg?is=adMmmKdVxraYO2yQ

by MikeNotThePope10 hours ago|

[-]

I just cut & pasted the share URL provided by YouTube. Strip out the query param if you like.

by hrmtst9383710 hours ago|

[-]

Maxing out context is only useful if all the information is directly relevant and tightly scoped to the task. The model's performance tends to degrade with too much loosely related data, leading to more hallucinations and slower results. Targeted chunking and making sure context stays focused almost always yields better outcomes unless you're attempting something atypical, like analyzing an entire monorepo in one shot.

by dev_l1x_be12 hours ago|

[-]

I never use these giant context windows. It is pointless. Agents are great at super focused work that is easy to re-do. Not sure what is the use case for giant context windows.

by maskull18 hours ago|

[-]

After running a context window up high, probably near 70% on opus 4.6 High and watching it take 20% bites out of my 5hr quota per prompt I've been experimenting with dumping context after completing a task. Seems to be working ok. I wonder if I was running into the long context premium. Would that apply to Pro subs or is just relevant to api pricing?

by virtualritz7 hours ago|

[-]

I haven't hit the "dumb zone" any more since two months. I think this talk is outdated.

I'm using CC (Opus) thinking and Codex with xhigh on always.

And the models have gotten really good when you let them do stuff where goals are verifiable by the model. I had Codex fix a Rust B-rep CSG classification pipeline successfully over the course of a week, unsupervised. It had a custom STEP viewer that would take screenshots and feed them back into the model so it could verify the progress resp. the triangle soup (non progress) itself.

Codex did all the planning and verification, CC wrote the code.

This would have not been possible six months ago at all from my experience.

Maybe with a lot of handholding; but I doubt it (I tried).

I mean both the problem for starters (requires a lot of spatial reasoning and connected math) and the autonomous implementation. Context compression was never an issue in the entire session, for either model.

by saaaaaam19 hours ago|

[-]

That video is bizarre. Such a heavy breather.

by coldtea16 hours ago|

[-]

What a weird and inconsequential thing to focus on...

He's just fucking closely miced with compression + speaking fast and anxious/excited speaking to an audience

by indigodaddy16 hours ago|

[-]

Most of that is just nervousness

by bushbaba17 hours ago|

[-]

Yes. I’ve used it for data analysis

by wat100008 hours ago|

[-]

I've used it many times for long-running investigations. When I'm deep in the weeds with a ton of disassembly listings and memory dumps and such, I don't really want to interrupt all of that with a compaction or handoff cycle and risk losing important info. It seems to remain very capable with large contexts at least in that scenario.

by twodave18 hours ago|

[-]

I mean, try using copilot on any substantial back-end codebase and watch it eat 90+% just building a plan/checklist. Of course copilot is constrained to 120k I believe? So having 10x that will blow open up some doors that have been closed for me in my work so far.

That said, 120k is pleeenty if you’re just building front-end components and have your API spec on hand already.

by a_e_k19 hours ago|

[-]

I've been using the 1M window at work through our enterprise plan as I'm beginning to adopt AI in my development workflow (via Cline). It seems to have been holding up pretty well until about 700k+. Sometimes it would continue to do okay past that, sometimes it started getting a bit dumb around there.

(Note that I'm using it in more of a hands-on pair-programming mode, and not in a fully-automated vibecoding mode.)

by chatmasta19 hours ago|

[-]

So a picture is worth 1,666 words?

by islewis19 hours ago|

[-]

The quality with the 1M window has been very poor for me, specifically for coding tasks. It constantly forgets stuff that has happened in the existing conversation. n=1, ymmv

by robwwilliams17 hours ago|

[-]

Yes, especially with shifts in focus of a long conversation. But given the high error rates of Opus 4.6 the last few weeks it is possibly due to other factors. Conversational and code prodding has been essential.

by 19 hours ago|

[-]

deleted

by hagen819 hours ago|

[-]

Well, the question is what is contributing to the usage. Because as the context grows, the amount of input tokens are increasing. A model call with 800K token as input is 8 times more expensive than a model call with 100K tokens as input. Especially if we resume a conversation and caching does not hit, it would be very expensive with API pricing.

by jFriedensreich6 hours ago|

[-]

yeah it totally does not remain coherent past 200k, would have been too nice.

by __MatrixMan__4 hours ago|

[-]

I bet it depends how homogenous the context is. I bet it works ok near 1M in some cases, but as far as I can tell, those cases are rare.

by j453 hours ago|

[-]

This might burn through usage faster too though.

by alexcali5 hours ago|