undefined

points

[-]

I have a similar setup. It might be worth checking out pi-coding-agent [0].

The system prompt and tools have very little overhead (<2k tokens), making the prefill latency feel noticeably snappier compared to Opencode.

[0] https://www.npmjs.com/package/@mariozechner/pi-coding-agent#...

by jtbaker2 days ago|

parent|

[-]

Just set up Pi after listening to Marios talk at AIE Europe[0] and have solid initial impressions! Especially on limited hardware like a MB Air, seems a lot more resource efficient

[0] https://www.youtube.com/live/_zdroS0Hc74?t=3633s

by theshrike7912 hours ago|

parent|

prev|

[-]

Pi is _really_ good for personal stuff, but since it lacks every single safety imaginable, it's not realy something one can deploy in a corporate environment :D

by tuzemec2 days ago|

parent|

prev|

[-]

Thanks! I just ran a quick test with pi, and it's working a bit faster.

by rsolva2 days ago|

prev|

[-]

I run this model on my AMD RX7900XTX with 24GB VRAM with up to 4 concurrent chats and 512K context window in total. It is very fast (~100 t/s) and feels instant and very capable, and I have used Claude Code less and less these days.

by jwr2 days ago|

prev|

[-]

I do the same thing on a MacBook Pro with an M4 Max and 64GB. I had problems until the most recent LM Studio update (0.4.11+1), tool calling didn't work correctly.

Now both codex and opencode seem to work.

by declan_roberts2 days ago|

parent|

[-]

Which do you prefer? And what lmstudio api works best for these tools?

by jwr2 days ago|

parent|

[-]

I use the OpenAI API for everything. I think codex is more polished, but I don't really prefer anything: I haven't used them enough. I mostly use Claude Code.

by davidwritesbugs2 days ago|

prev|

[-]

I did the same using the mlx version on an M1 Macbook using LMStudio integrated into XCode. I had to up the context size I ran it a against a very modest iOS codebase and it didn't do well, just petered out at one point. Odd. Pretty good chatbot and maybe against other code it'll work but not useful with XCode for me

by qingcharles2 days ago|

prev|

[-]

I spun up a GPU on Runpod and tried the 31b full res and it was really impressive. I'm now using it via the Google API which gives you 1500 requests a day for free, IIRC.

by hak8or2 days ago|

parent|

[-]

Be very careful about using googles apis as a consumer, they have poor rate limiting and ineffective anomoly protection.

I (a hobbyist running a small side project for a dollar or two a month in normal usage, so my account is marked as "individual") got hit with a ~$17,000 bill from Google cloud because some combination of key got leaked or my homelab got compromised, and the attacker consumed tens of thousands in gemini usage in only a few hours. It wasn't even the same Google project as for my project, it was another that hasn't seen activity in a year+.

Google refuses to apply any adjustments, their billing specialist even mixed up my account with someone else, refuses to provide further information for why adjustments are being rejected, refuses any escalation, etc. I already filed a complaint with the FTC and NYS attorney General but the rep couldn't care any less.

My gripe is not that the key was potentially leaked or compromised or similar and then I have to pay as a very expensive "you messed up" mistake, it's that they let an api key rack up tens of thousands in maybe 4 hours or so with usage patterns (model selection, generating text vs image, volume of calls, likely different IP and user agent and whatnot). That's just predatory behavior on an account marked as individual/consumer (not a business).

by qingcharles1 days ago|

parent|

[-]

Agree totally. I'm super paranoid and anxious about this issue. I've seen too many horror stories posted on Reddit. I did set alarms at $10 a day on the account, but those are only alarms and it could be thousands over before I see them.

I think Google did finally implement hard limits this month and I need to go and find that setting, but it's useless if, like you say, they have shitty rate limiting and measurement so that you're way over the limit before they stop you.

by ozgrakkurt2 days ago|

prev|

[-]

Not sure if you already tried but both GLM Flash and Qwen models are much better than Gemma for that in my experience.

I am using a 24GB GPU so it might be different in your case, but I doubt it.

by smrtinsert2 days ago|

prev|

[-]

gguf or mlx? edit, just tried a community mlx and lm studio said it didn't support loading it yet.