undefined

points

[-]

>> I feel closer to where I should be; writing code should still be free. Both free as in free beer, and free as in freedom.

I’m just pleased by the competition, agree with the ideal of free and local but sustainable competition is key: driving $200 p/m down to a much much lower number.

by datadrivenangel16 hours ago|

prev|

[-]

Gemma4 feels the most "claude-like" of all the models I've run locally on my M5 mbp.

by chr15m15 hours ago|

parent|

[-]

I found on coding tasks that Qwen 3.5 can actually do the thing whereas Gemma 4 went off the rails frequently. Will try this new 3.6 release today.

by da-x2 hours ago|

parent|

[-]

I use Qwen 3.5 122B on an RTX PRO 6000 with open code, and very pleased. I don't feel a need for using a closed model any more. The result after answering questions in Plan mode is almost always what I want, with very few occasional bugs. It does a lot of effort to see how the code I am working on is written now while extending it in the same style.

If they release a Qwen 3.6 that also makes good use of the card, may move to it.

by verdverm14 hours ago|

parent|

prev|

[-]

There was a qwen-3.6 MoE six days ago that I thought was better than Gemma 4. Today's is a dense model. (gemma release both a 26B MoE and a 31B dense at the same time)

I have intention to evaluate all four on some evals I have, as long as I don't get squirrelled again.

by djyde15 hours ago|

prev|

[-]

What level of programming tasks can a 27B model handle? Even with Claude, I'm occasionally not satisfied, and I can't imagine how effective a 27B model would be.

by sleepyeldrazi5 hours ago|

parent|

[-]

I ran 3 prompts (short versions, full version in the repo):

- Implement a numerically stable backward pass for layer normalization from scratch in NumPy.

- Design and implement a high-performance fused softmax + top-k kernel in CUDA (or CUDA-like pseudocode).

- Implement an efficient KV-cache system for autoregressive transformer inference from scratch.

and tested Qwen3.6-27B (IQ4_NL on a 3090) against MiniMax-M2.7 and GLM-5 with kimi k2.6 as the judge (imperfect, i know, it was 2AM). Qwen surpassed minimax and won 2/3 of the implementations again GLM-5 according to kimi k2.6, which still sounds insane to me. The env was a pi-mono with basic tools + a websearch tool pointing to my searxng (i dont think any of the models used it), with a slightly customized shorter system prompt. TurboQuant was at 4bit during all qwen tests. Full results https://github.com/sleepyeldrazi/llm_programming_tests.

I am also periodically testing small models in a https://www.whichai.dev style task to see their designs, and qwen3.6 27B also obliterated (imo) the other ones I tested https://github.com/sleepyeldrazi/llm-design-showcase .

Needless to say those tests are non-exhaustive and have flaws, but the trend from the official benchmarks looks like is being confirmed in my testing. If only it were a little faster on my 3090, we'll see how it performs once a DFlash for it drops.

by __s14 hours ago|

parent|

prev|

[-]

Basic triage is good. I've found I need to mostly handle programming, but local models have been good for pointing me at where to look with just "investigate https://github.com/HarbourMasters/Shipwright/issues/6232" as prompt

by justinclift13 hours ago|

prev|

[-]

> Qwen 3.6:27b uses 29/32gb of vram

What context size are you using for that?

Btw, are you using flash attention in Ollama for this model? I think it's required for this model to operate ok.

by skirmish11 hours ago|

parent|

[-]

I squeezed it into 24 GiB VRAM (since I have RX7900XTX):

-- Q5_K_M Unsloth quantization on Linux llama.cpp

-- context 81k, flash attention on, 8-bit K/V caches

-- pp 625 t/s, tg 30 t/s

by tgtweak11 hours ago|

parent|

prev|

[-]

Depends entirely on quantization. Q6_K with max context length (262144) is ~40GB of VRAM.

Q8 with the same context wouldn't fit in 48GB of VRAM, it did with 128k of context.

by pawelduda16 hours ago|

prev|

[-]

How many tokens/s do you get on RTX 5090?

by gfosco13 hours ago|

parent|

[-]

I set this up today on my 5090 at Q6_K quantization and Q4_0 KV, got 50 tokens/s consistently at 123k context, using ~28/32gb vram through LM Studio.

by pawelduda3 hours ago|

parent|

[-]

Wow, that sounds usable. I know it's anecdotal but how did you find the quality of the output, and can you compare it to any closed source model?

by girvo11 hours ago|

parent|

prev|

[-]

Not that you asked but I’m getting ~20 tokens/s on my DGX Spark (Asus actually) using an Int4 AutoRound quant, MTP 1 and some other tricks

by overgard15 hours ago|

parent|

prev|

[-]

Can't answer for an RTX 5090, but for an RTX 5080 16GB of RAM (desktop), I get about 6 tokens/sec after some tweaking (f16->q4_0). Kind of on the borderline of usable.. probably realistically need either a 5090 with more RAM or something like a Mac with a unified memory architecture.

by datadrivenangel15 hours ago|

parent|

[-]

My M5 Pro is getting ~11 tokens per second via OMLX for an 8 bit quant.

by angoragoats14 hours ago|

parent|

prev|

[-]

A Mac is not going to be all that much faster than a 5080 with any models, other than the ones you can’t currently run at all because you don’t have enough GPU+CPU memory combined.

You’re much better off adding a second GPU if you’ve already got a PC you’re using.