undefined

points

by elAhmo1 hours ago |

comments

by spacebacon1 hours ago|

[-]

They are doomed. Publishing small wins while they can.

https://open.substack.com/pub/sublius/p/srt-introspect-why-c...

by Wowfunhappy44 minutes ago|

prev|

[-]

Something I found helpful: In this article, scroll down to the first big image, which is a graph labeled “Agentic coding performance by effort level”. https://www.anthropic.com/engineering/april-23-postmortem

This convinced me to just always set 4.7 to xhigh. Admittedly not sure about 4.8.

by thaanpaa48 minutes ago|

prev|

[-]

Probably limits the number of intermediate tokens one way or the other. Almost certainly the impact on the result is close to zero.

by kkukshtel57 minutes ago|

prev|

[-]

Not only this but hermetic checks on local machines for spot testing new models is becoming increasingly difficult, if not impossible.

- We have 0 visibility into what Anthropic does with our own prompts server side (do they return cached results from similar queries? Do we develop our own hot paths?).

- Local memory files are written independent of project directory and are acted on by the new models, even if old models wrote them

- CLAUDE.md files have varying degrees of efficiency and different models (and effort) treat them differently

- Our own git history "supports" newer models - ie if you have a larger body of work in git when you adopt a new model (like 4.8) than when you started from scratch with 4.6 or something, 4.8 may "appear" smarter when in fact you just have more evidence and signal about what you intend for a model to do.