upvote
For what it is worth, I’m on a similar machine. (9070XT,5900X) and found a lot of performance improvement over ollama by compiling llama.cpp and running with —no-mmap and —perf. The context is still quite small though. With online models I use contexts of at least 200k which is useful for longer running/more complicated commands.

Locally I haven’t gone much further than 8k. That is sufficient for small changes on small code bases. And you need condensed tool output.

I haven’t tried any tool that compresses the tokens yet.

reply
I would rather we give up the idea of running open models on RTX cards and instead focus on running much bigger open models on H200s.

1. The hardware will eventually catch up.

2. This keeps the delta between frontier models smaller.

3. We can still fine tune and own the weights.

4. The models will be more useful, faster, and reliable.

RTX is hobbyist tier, not professional tier.

Gated cloud models from hyperscalers treat us like hobbyists in their own right.

We need equivalent scale models, but open.

reply
Try llama.cpp it seems to be a lot more performant and a lot more hackable. Also I'm surprised how substantial the impact of some of the inference configs (beyond just temp) can have, though this is much more model specific.
reply
> The best "free" experience I've found is using OpenCode with Big Pickle.

I have absolutely zero interest in free. I honestly don't think I'm even remotely in the same demographic as people using free tiers / models.

I want to pay. I don't want my data used for training. I want it to be open. I want it to be consistently up (more than Claude!). I want it to be fast. I don't want it to be subsidized as that's just an excuse for shitty quality. Deepseek flash knocks it out of the park on all of these except you're data is used in training. I'm fine with it being hosted since there's no way I'm using it 24/7, but data MUST be private.

Basically I want Hetzner and OVH to run open model clouds. I'm convinced this is going to happen eventually when everyone realizes this is a commodity.

reply
You might want to look into Nebius
reply
I'm probably somewhat adjacent to you. I would be happy to pay, but I just don't want to pay any of the companies that are actually offering things right now. I had the $20/month sub for Claude for a couple months, until one day I kept inexplicably getting errors saying I hit the limit even though their site showed my usage at less than half for the session and 8% for the week, and it seemed silly to pay for something that couldn't even properly respect its own measurements. OpenAI sketches me out too much as a company, Cursor feels lackluster when I use it for work from the account they pay for (and now is getting acquired by maybe the only AI company even sketchier than OpenAI), and I wasn't particularly impressed with Gemini or Mistral Vibe either when I tried them on the free tiers either.
reply
I was paying around $500 / month on average between multiple providers for over a year. I cancelled one a while ago because of pretty bad service availability (Bet you guess who that is!), which by all reports hasn't improved much.

For me, paying from $200 - $500 / month is reasonable if I can sustain a disruption free flow that doesn't require constant yak shaving. What I've found experimenting with DeepSeek on some open source library stuff is that it's actually going to cost me much less if I don't need frontier vibing (which I don't).

reply
You can pay, and also use deepseek-v4-flash. OpenRouter even lets you "block" or limit your usage to providers that don't train on data. Since the weights are open, other companies are already serving the model on non-DeepSeek owned hardware: https://openrouter.ai/deepseek/deepseek-v4-flash
reply
Good to know. I hadn't checks since early is DS4's launch when they were the only provide (I think maybe there was one other, but they also trained on your data). I see several private options now.
reply
Hard to guarantee it's private if you don't keep it local... I don't have a lot of trust for companies in this space.
reply
Yes, but I think that'll change eventually. If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist. At least that's my theory.

There'll probably need to be a threat of massive litigation should they fail to comply with such a policy.

reply
> If you trust hosting your code with a specific cloud provider then you'll probably also trust them for code assist.

I'm interested in this thought. There is significant motivation for providers to create a verifiable way for them not to deal with having access to client interactions with LLMs at all. Whatever standards and protocols have to be come up with in order to reassure clients.

Any good standards for privacy when interacting with LLMs could also trickle down to smaller providers, and everyone could offer guarantees. Even if the guarantee was literally just an insurance policy and a private court to decide if it pays out.

reply
You can specify which providers you want to serve your model in OpenRouter. Then you can chose US-based ones.
reply
These competent open models you want to use were trained on data from people like you and me.

I wonder if there are competent models trained purely on permissive open-source code like MIT or Apache 2.0.

reply
MIT and Apache 2.0 both require attribution, so it's not like limiting to those would help in license compliance.
reply
I found that, with the heavily quantized Qwen3 models I can cram onto my 3060 Ti, telling the model to use its tools in the system prompt made it a lot more likely to actually do it. YMMV of course, but give it a shot.
reply