undefined

points

[-]

This is not my experience at all. Even the Nous Research guys have stated that "Qwen3.6-27B is the canonical local model to use Hermes Agent with" [https://old.reddit.com/r/LocalLLaMA/comments/1sz2y76/ama_wit...]. I am finding the same when used with Pi and OpenCode.

Gemma will just stop mid-tool call. It's been slower and I've had to reduce context size to run it. Qwen3.6 27b has been rock solid using club 3090's single card setup for agentic use -- https://github.com/noonghunna/club-3090/blob/master/docs/SIN...

by adam_arthur2 hours ago|

parent|

[-]

I'm talking about automation generally, not agent loops.

E.g. prompt A to achieve X, output in format Y. Use Y to do something in prompt B.

Agentic loops will underperform deterministic control flow pipelines (with non-determinism constrained to LLM calls).

Agents are more general, which is the main advantage. But inherently a more general solution will waste context on unnecessary reasoning.

Try asking the smaller Qwen models to output a JSON in a specific format. It basically can't do it consistently with a moderately sized prompt unless you constrain the token generation via GGML or are extremely repetitive and specific about it. (Thinking disabled)

Gemma 4 will do it correctly pretty much 100% of the time. (Thinking disabled)

Applies to other rule following as well in my experience.

Qwen may be better at toolcalling and certainly probably codegen.

It seems to me Google explicitly designed Gemma for edge device automation, and didn't fine tune for agentic or coding use cases.

by ozim11 minutes ago|

prev|

[-]

I was expecting DGX Spark to run Gemma 31b Q4 much faster.

I was expecting it would run Q8 in 50 tok/s.

I guess that’s good I stopped thinking about buying it because I would be disappointed.

by trouve_search2 hours ago|

prev|

[-]

On a 5090, gemma4 26B runs at 350TPS with the command below [1] and gemma4 31B is around 150TPS with a similar command.

I'm really surprised how much slower a DGX spark is for the same price.

1. Here's my command.

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \ vllm serve cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit \ --dtype auto \ --gpu-memory-utilization 0.95 \ --kv-cache-dtype fp8 \ --enable-chunked-prefill \ --enable-prefix-caching \ --trust-remote-code \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --reasoning-parser gemma4 \ --max-num-batched 16000 \ --max-model-len 64000 \ --max-num-seqs 12 --speculative-config '{"model": "./gemma-4-26B-A4B-it-assistant", "num_speculative_tokens": 4}'

by adam_arthur2 hours ago|

parent|

[-]

Yes, I'd recommend a 5090 over the DGX Spark if your goal is general automation.

You can run multiple instances of these models in parallel on the DGX Spark which somewhat mitigates the difference if your task is parallelizable.

But I'd take the simplicity of a single thread and higher throughput personally.

Overall of course still better to wait for next gen devices if you can.

by msp261 hours ago|

prev|

[-]

Yep agreed completely. I couldn't imagine torturing myself with a small model for local coding. But Gemma 4 31B is so fucking good for a variety of language modelling tasks.

by gopher_space2 hours ago|

prev|

[-]

In my mind it’s a question of knowing what you want to build and how to divide the project into tasks your local setup can handle.

If you don’t need the machine to respond instantly (or explain your own business model to you) everything can be local and it’s been like that for a few years now.