Here's how to use the skill on the latest version:
/code-review # do a balanced code review. checks for bugs and inconsistencies, poor code quality, duplication, band aids, etc.
/code-review --fix # same as above, but also fix the issues
# choose an explicit effort level (defaults to your current effort level). all of these also accept --fix:
/code-review low
/code-review medium
/code-review high
/code-review xhigh
/code-review max
# do an expensive and extremely thorough review (reliably catches >99% of bugs, costs $3-20 per review depending on complexity):
/code-review ultra
Open to feedback if anyone has feedback or ideas for how to make these even nicer to use.
As a casual user working on hobby projects, I struggle to keep up with the pace of changes and knowing what to use when. My default now is to use Opus for all coding (sonnet is fine but seems dumber) and to prompt it for everything I need. I’ve had great success with this but clearly I’m missing power user functions with the slash commands and such.
It's analogous to how in the early days you could see benefits by telling the models to "think step by step". /code-review is something like "review angle by angle". "Consider removed behavior" and also "Look at language gotchas" and also "Look at test changes"...etc. Yes these are all somewhat implicitly already part of what "code review" means, but the models perform best with explicitness.
If you want my 2c as a power user: just don't think about it and use /code-review xhigh --fix. This will cover like 98% of what you want out of code review. It's a good skill.
Outsourcing comprehension to a machine is probably gonna cost you more time in the long run.
- Defining the issue/ticket, what "success" looks like (if I have a good idea of this), high level approach guidance 50%
- Dispatch agent to work on it 5%
- Occasionally return and nudge agent + send /simplify or /code-review 5%
- Look at the code/session summary, divergences from the plan, ask followup questions 40%
Occasionally yes there is some solution the AI chose that is suboptimal and I would prefer fixed in a different way. Mostly though it's straightforward.
Is there something equivalent when coding in the first place? Eg /code high “prompt”
https://github.com/anthropics/claude-code/blob/main/plugins/...
This stuff all seems so nebulous to me and I’ve yet to see anything that says use x in y situation. So I default to higher effort levels than I likely need.
/code-review ultra
main suggestion would be to sound a lot less optimistic about that it finds 99% of bugs or that its at all thorough, and instead list that it is time capped, and will only find bugs that you explicitly tell it to look for.
i used my three runs of ultrareview.
the first run with no other prompting found a couple typos in markdown only
the second one i prompted it with several themes of known open bugs in the code, and it found 6 items
and then the third one i ran after doing an actual long audit through gemini to make a much more detailed prompt about issues in the code
and for that one, instead of doing an exhaustive run, it just never started, so no idea if it worked
but the experience had no relation at all with the reliability or thoroughness claims
I find the mix between slash commands that are programmatic harness configuration and control commands (/config, /model, /feedback, /fork, /usage, etc.) and ones that are little more than prompt template insertion (/code-review, /<skill>, etc.) to be a little confusing and unnecessary. A slash command should be one thing, and one thing only: a command for the harness, not the agent.
When I invoke a slash command like /code-review, I should be invoking some additional harness functionality, something above and beyond the agent's sphere of influence - not just pasting some hidden text into the next turn. Otherwise, why wouldn't I just say "Claude, review this code"?
Yet most of these "added value" commands bloating the slash command list, are just shortcuts for copy and paste. I don't want to go to have to learn the syntax of a special /code-review command (which options are positional args, which are --flags, etc.), and I'm much less likely to use or even be aware of a command like this, when I can just ask "Do a balanced code review and fix the issues", or use the GUI to set the effort level to xhigh before asking "Review my code." That way I can also be more specific about exactly what I need, rather than relying on what's in the canned prompt - a prompt which I'll probably never read and vet myself anyway. The value added by the slash command needs to be really high compared to just typing a prompt, for it to justify the friction of discovery and learning the syntax.
So I suppose I'm advocating for a different system. Keep slash commands for meta-level harness control and configuration, and add a new mechanism for canned prompt insertion, one which is tailor made for that purpose rather than overloading the slash command system. Let the user see what's in the canned prompts, and even make adjustments or edits as needed before sending them, one-time or persisted. Provide a GUI in the app with the user's favorite prompts, where the user can add, delete, and edit them, making it easy to invoke and insert them as needed. Or let the agent automatically discover and use them as needed, rather than requiring the user to remember and recall their magic shortcuts and their arguments. That's just one idea.
Skills, plugins, commands, and so on, need to be consolidated not just for code review of course but across the full architecture of how prompt templates are managed.
I see now in 2.1.152 you added those focus areas back to /code-review, but still bundled with the correctness finding. It would be great to have more fine grained control over the /code-review angles beyond just effort level. Or maybe you would recommend that I just specify that as freeform input after effort level?
In what scope?
The subagent approach is structurally different from the others because it runs with clean context. That has three major effects:
1. All other things being equal, it will result in a lower cost-to-solution because of the quadratic cost scaling of an LLM session (input token or cached-input cost being paid with each new round).
2. The review model will not be able to 'cheat' by retaining assumptions from the main session, such as "x must be done like y." For people, this is why having a separate person perform code review (or, if not possible, reviewing code after a mind-clearing break) is handy; the applicability of this analogy to LLMs is vague but reasonable.
3. The main model will only see the results of the review, not the detailed reasoning that leads up to it. On one hand this avoids more context pollution, but on the other hand it might lead to duplicative logic to re-discover the mechanics behind bugs found.
> I checked the session logs to see how often the agents were actually invoking the LSP tools. The answer was they had invoked them literally once the entire time.
I think the intent behind 'install a language server plugin' is that these tools should lint automatically after every edit, without waiting for an explicit call from the LLM.
Yes, and this is what I mean by "which context the prompt runs in". The subagent approach is different and has pros and cons, and it may in some situations be better (but perhaps not in others). On the other hand, I can also just create a new conversation and paste my own review prompt into it; then take the last turn's summary output and feed it back into my main conversation thread in the unusual event I would need to do so. Spawning a subagent is a convenient shortcut for this, but ultimately, it's the same thing.
> I think the intent behind 'install a language server plugin' is that these tools should lint automatically after every edit, without waiting for an explicit call from the LLM.
This is a great point and I had only checked my session logs for explicit tool calls. I went back and looked for diagnostics injected automatically by the harness after every edit, and whether the agent made use of them.
Claude: neither the Rust or Dart LSPs ever inserted any diagnostic events, but Ty did. Across 627 sessions, ty-lsp injected diagnostics blocks in 186 sessions, with a total of 33 findings. Out of those 33, 32 were dismissed as unrelated (13) or pre-existing (19). Only 1 finding was acted upon. The model is in the habit of running the batch analysis tools (ruff, ty, cargo clippy etc.) and prek anyway, so it would have caught that diagnostic regardless.
Codex: no diagnostic events were inserted by any of the LSPs.
So I won't be reinstalling those LSPs.
When I need code review I should just say “review it”. Model should figure out what plugins, skills, etc. to use.
I’m not aware of anything fundamentally unique about skills or commands, they’re just more tokens to shape the llm
Yes, yes, thank you, sometimes I feel like I'm taking crazy pills.
The industry and overall developer ecosystem has become absolutely mesmerized by the act of creating and popularizing little bits of protocol and machinery to dress up the act of inserting text into the machine. Yes, they're useful and provide some consistency, but I'm convinced that the main reason people like them so much is because they put a thin "I'm still a programmer wielding complicated tools that laypeople don't understand" coating over the fact that we're all just asking the AI nicely to do a thing.