upvote
Prompt matters. Obviously if you want another model opinion you must generate from the scratch using the same prompt and then you can try to synthesize, but working with an existing response can work if desired. I use explicit instructions to find issues with assigned severities and then these are going through the panel of judges, only issues passing certain threshold are fixed in the original response.

I'll share a revelation which vastly improved my results: tell judges to evaluate truth and usefulness/should-be-fixed axis separately. Because inevitably with a prompt that is forcing to find issues you will end up with nitpicks. Plus truth axis allows to better evaluate the issue-finder models for your use case.

That's some part of what happens when I generate explanations like this one: https://hanzirama.com/character/%E6%9D%A5#explain - at this point the site is a small side product of my LLMs-evaluation machinery.

Bonus content for patient readers: if you need top quality you will likely need to pin provider(s) on OR, :exacto is not enough to get good repeatable results especially for open-weights models.

reply
I made a rough version of this in 2024[0], interesting to see that the idea is still around. I had the ability to set "quality thresholds", but it didn't seem to matter, the frontier models pretty much always agreed with each other and scored the answer highly, I should revisit it since it is a whole different ballgame than it was 2 years ago.

[0] https://github.com/Ceroxylon/konsensis

reply
I think it depends on whether the answer is verifiable.

I have tested two judge models in my apps:

1. Judge model for a resume tailor. It evaluated the result resume vs the base resume and JD and judged it out of 10 on fit and honesty. It worked well and was useful.

2. Review model in my LLM trading bot platform. It reviews decisions from the Main model. The problem here is that the bot is navigating ambiguity. So unless the Review model catches an outright blunder (e.g. making a decision on wrong candle price or a BUY when it should be a SELL), the Review model can do more harm than good.

First, it adds latency to decisions, decisions take twice the amount of time (like be 60s instead of 30s for Gemma 4 31B). Second, it can make the bot too cautious, because Review model only runs on BUY/SELL decisions and not HOLD decisions, so the bot will only make less trades instead of review model increasing number of trades (because of latency and cost).

So overall, I think you'll get better results with a better model single shotting it rather than a review model if the answer isn't easily verifiable. But then why do you need a judge model and not just have the same agent review itself?

ALSO, if you read the reasoning text for a reasoning model (like Gemma 4), you see that it ALREADY reviews itself. So it's doing its best, re-review isn't really adding information. It's an interesting experiment, but you need to evaluate on a case by case basis.

reply
I've found that if I tell a judge that the answer came from a small and weak local LLM, it will pick the answer apart brutally...but since I have not done this systematically, I dont know how well it generalizes past my vibes.

Anyone else fell like if you can trick the LLM into a mode where it "feels" superior, it will act the asshole very well?

reply
Yeah. I usually do this by telling it to be adversarial and find gaps and holes. Not fool proof but it does seem to increase the quality. It has helped when using local models in particular.
reply
Yeah, you have to shortcut the RL-trained people pleasing
reply
Yes, definitely not a new idea. I had a multi-turn composite model in 2024 that was outperforming the top models across benchmarks: https://x.com/LechMazur/status/1828804485033992514.
reply
Yeah, same experience. It turned out that objectively better answers were not that easy to find plus the expense plus it’s slow.
reply
Nice - I built an npm package in a similar fashion called Agent Order: https://github.com/btahir/agent-order

I think there is alpha just have to be very careful how you let the models com up with solutions and collaborate.

reply
I've started to have different models review things like architectural planning docs- and I think for these more "fuzzy" outputs the differences between the outputs can be quite different and I can use my own "taste" to pick the best one.

I don't think it would work without a human in the loop but it is surprising to me how varied models' vibes are and how a system design varies by what it thinks is important to include and emphasize.

reply
I’d be interested in the benchmarking if you ever write it up! People do seem to assume LLM as a judge/panel improves outcomes (and arguably it does in cases like code review?) but I suspect it is very situational and the priors from human panel of experts don’t always translate cleanly.
reply
I had a very similar experience. I'd be keen to see how you went about it if you release it.

Here's what I use: https://github.com/DheerG/swarms

reply
I've been thinking along those lines, too. Could you give a general overview of your solution?
reply
I think it depends.

I regularly ask both GPT and Gemini to give me options - programming libraries to do X, architecture suggestions, names for projects/services/classes

After they answer I ask each model what does it think of the other answer, and to give me a final suggestion considering both answers.

Both GPT and Gemini would frequently say "that other answer is much better than my one, it considered X factor that I missed".

reply
Try telling it the answer came from a small local LLM..the condescension can become palpable.
reply
But.. but I told the LLM that it is an _expert_, is that worth nothing??
reply
Make sure to remind it to make no mistakes.
reply
You found the smoking gun!
reply
prompting "no mistakes" was load-bearing
reply
[dead]
reply
[dead]
reply