Subscription: opencode go
I also use a claw agent[1] via Telegram, which uses pi.dev under the hood with my opencode go subscription.
[1] I forked one of those Claw projects (bareclaw) and made many changes to it.
---
Funny you mention that, because I started noticing the word 'harness' being used everywhere about a month ago, even though I hadn’t seen it before (in this context). As I don’t trust my memory, I assumed I had just been overlooking it and added it to my vocabulary. However, a Google Trends search does show increased usage since the end of March: https://trends.google.com.br/trends/explore?date=today%203-m...
It's probably just a coincidence. But that would be pretty interesting if we have an example of some kind of memetic phenomenon where one or more popular LLMs makes a claim that people then start to repeat as true, or at least follow up on it and start writing about it, and in so doing the claim becomes true. Even if it didn't happen in this case, I feel like it's only a matter of time.
Highly recommend as a clean way to try out the upstart models.
Not sure where deepseek 4 sits
I have about 1KLOC of harness code written by Kimi to work around quirks in Kimi not needed for any other model I've tested, such as infinite toolcall loops and other weirdness.
You can do quite a bit with it and never run into those quirks, or you might hit it every request.
It is very sensitive to "confusing" things about it's environment in a way Sonnet and Opus are not.
Still great value, but they have some way to go.
How do you think the large providers do inference? No single GPU has 1TB plus of memory on board. It’s a cluster of a bunch of gpus.
GPU interconnect speeds are a big bottleneck today for GPU's in AI applications. Data can't move between them fast enough.
The model is fine, Ive switched to it entirely for a personal project, but it's not opus.
And no, you're not running then locally unless you're a millionaire. You still need hundreds of GB (500+++) of VRAM on your graphics card - that's not at a level of consumer electronics.
Sure you can run the quantized models, but then you're at Haiku performance.
Claude becomes near lobotomized at beyond 500,000 tokens. I don't believe much quality code gets outputted at such high token counts, not to mentioned drastically increased cost.
270k isn't massive, but its very usable with compaction. Not every task needs the full context history.
Quantized models do have a quality / accuracy impact, although it is not as drastic as you suggest. There is some good data on this [0].
"These findings confirm that quantization offers large benefits in terms of cost, energy, and performance without sacrificing the integrity of the models. "
One thing that is worth mentioning is quant models are not created equally, they are not always scaling at the same rate. [1] For example not all tensors contribute equally to model accuracy. In practice, the most sensitive parts (such as key attention projections) are often quantized less aggressively to preserve the quality of the inference.
[0] - https://developers.redhat.com/articles/2024/10/17/we-ran-ove...
[1]- https://medium.com/@paul.ilvez/demystifying-llm-quantization...
Check out tensor parallelism
Presumptuous and wrong "memories" from a one-off command which affect all future commands, repeated/nonsensical phrases in messages, novel display bugs which make going back in the conversation impossible (I can't tell where I am), lack of basic forking features (resume a current convo in a second CC instance -> fork = no history for that convo?), poor/unclear reasoning, a new set of unclear folksy phrases (it really wants to "cut code" all of a sudden).
Qwen + Opencode has been a game changer: which runs very well on a 4090 for basic/exploratory/private tasks, and being able to switch to and between frontier models (using openrouter in my case) to avoid vendor lock in feels like basic hygiene.
There's also the homo economicus psychological difference between having a token budget to use up, and a cost per token. I'm more thoughtful about my usage now.
So, at least better than GitHub, right? :)
But well, their ones are way harder to run.
It's bordering on being useless.
You just need to have some idea of what to do when your frontier model is not available. Use Qwen? Read the code you've been generating?
Multi-model coding tools seem like the obvious, sane path forward, but the Will to Lockin is strong.
Claude Code and Codex are solid, but the real reason people use these over alternatives is that they have dramatically lower overall cost compared to open alternatives.
But it did remind me of how Japanese websites sometimes have opening hours. The website shows a closed status page during the out if hours time.
Which I think makes some sense for some services for two reasons: your customers build habits and expectations around available service hours, and that in turn gives you regular maintenance windows that can accommodate large impactful changes.
It is one of the reasons a 24/hr public transit network doesn't make complete sense. You shouldn't disrupt a service because people come to rely on it, but you can't disrupt a service you never provided in the first place.