(static.stepfun.com)
- Very context efficient: SWA by default, on a 128G mac I can run the full 256k context or two 128k context streams. - Good speeds on macs. On my M1 Ultra I get 36 t/s tg and 300 t/s pp. Also, these speeds degrade very slowly as context increases: At 100k prefill, it has 20 t/s tg and 129 t/s pp. - Trained for agentic coding. I think it is trained to be compatible with claude code, but it works fine with other CLI harnesses except for Codex (due to the patch edit tool which can confuse it).
This is the first local LLM in the 200B parameter range that I find to be usable with a CLI harness. Been using it a lot with pi.dev and it has been the best experience I had with a local LLM doing agentic coding.
There are a few drawbacks though:
- It can generate some very long reasoning chains. - Current release has a bug where sometimes it goes into an infinite reasoning loop: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...
Hopefully StepFun will do a new release which addresses these issues.
BTW StepFun seems to be the same company that released ACEStep (very good music generation model). At least StepFun is mentioned in ComfyUI docs https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1
I had tried Nemotron 3 Nano with OpenCode and while it kinda worked its tool use was seriously lacking because it just leans on the shell tool for most things. For example, instead of using a tool to edit a file it would just use the shell tool and run sed on it.
That's the primary issue I've noticed with the agentic open weight models in my limited testing. They just seem hesitant to call tools that they don't recognize unless explicitly instructed to do so.
Did anyone do this kind of math?
However, if you check the prices on Chinese models (which are the only ones you would be able to run on a Mac), they are much cheaper than the US plans. It would take you forever to get to the $10k
And of course this is not even considering energy costs or running inference on your own hardware (though Macs should be quite efficient there).
It’s my layman understanding that would have to be fixed in the model weights itself?
It can also be a bug in the model weights because the model is just failing to generate the appropriate "I'm done thinking" indicator.
You can see this described in this PR https://github.com/ggml-org/llama.cpp/pull/19635
Apparently Step 3.5 Flash uses an odd format for its tags so llama.cpp just doesn't handle it correctly.
It is a bug in the model weights and reproducible in their official chat UI. More details here: https://github.com/ggml-org/llama.cpp/pull/19283#issuecommen...
- OpenAI completions endpoint
- Anthropic messages endpoint
- OpenAI responses endpoint
- A slick looking web UI
Without having to install anything else.
For example, when I tried gpt-oss 120b with codex, it very easily forgets something present in the system prompt: "use `rg` command to search and list files".
I feel like gpt-oss has a lot of potential for agentic coding, but it needs to be constantly reminded of what is happening. Maybe a custom harness developed specifically for gpt-oss could make both models viable for long agentic coding sessions.
[0] https://gist.github.com/lm2s/c4e3260c3ca9052ec200b19af9cfd70...
Not sure if it's directly accessible, but here's the link: https://stepfun.ai/chats/213451451786883072
I don't know anything about TerminalBench, but on the face of it a 51% score on a test metric doesn't sound like it would guarantee 'unwavering stability' on sophisticated long-horizon tasks
Looking at https://www.tbench.ai/leaderboard/terminal-bench/2.0, I see that the current best score is 75%, meaning 51% is ⅔ SOTA.
I'm reminded of https://swe-rebench.com/ where Opus actually does better without CC. (Roughly same score but half the cost!)
For a benchmark named terminal bench, I would assume it would require some terminal "interaction", not giving the code and command.
Edit: there are 4 bit quants that can be run on an 128GB machine like a GB10 [1], AI Max+ 395, or mac studio.
[1] https://forums.developer.nvidia.com/t/running-step-3-5-flash...
Does this really mean anything? I for example, tend to ignore certain benchmarks that are focused towards agentic tasks because that is not my use case. Instruction following, long context reasoning and non-hallucinations has more weight to me.
IQ4_NL @112 GB
Q4_0 @ 113 GB
Which of these would be technically better?
[1] https://huggingface.co/bartowski/stepfun-ai_Step-3.5-Flash-G...
Most of "live" benchmarks are not running enough with recent models to give you a good picture of which models win.
The idea of a live benchmark is great! There are thousands of GitHub issues that are resolved with a PR every day.
https://docs.google.com/document/d/1pe_gEbhVDgORtYsQv4Dyml8u...
You can achieve incredible tok/dollar or tok/sec with Qwen3 0.6b.
It just won't be very good for most use cases.
I know what they are trying to do. They are attempting show a kind of pareto frontier but it’s a little awkward.
> create a single html file with a voxel car that drives in a circle.
Compared to GLM 4.7 / 5 and kimi 2.5 it took a while. The output was fast, but because it wrote so I had to wait longer. Also output was .. more bare bones compared to others.
Each time a Chinese model makes the news, I wonder: How come no major models are coming from Japan or Europe?
Europe is a bad place to try and be successful in tech.
Keep a tab on aihubmix, the Chinese openrouter, if you want to stay on top of the latest models. They keep track of things like the Baichuan, Doubao, baai (beijing academy), Meituan, 01.AI (yi), xiaomi, etc...
Much larger chinese coverage than openrouter
Before Step 3.5 Flash, I've been hearing a lot about ACEStep as being the only open weights competitor to Suno.
Do you know of a couple of interesting ones that haven't yet?
Keep your eye on Baidu's Ernie https://ernie.baidu.com/
Artificial analysis is generally on top of everything
https://artificialanalysis.ai/leaderboards/models
Those two are really the new players
Nanbeige which they haven't benchmarked just put out a shockingly good 3b model https://huggingface.co/Nanbeige - specifically https://huggingface.co/Nanbeige/Nanbeige4.1-3B
You have to tweak the hyper parameter like they say but I'm getting quality output, commensurate with maybe a 32b model, in exchange for a huge thinking lag
It's the new LFM 2.5
ollama create nanbeige-custom -f <(curl https://day50.dev/Nanbeige4.1-params.Modelfile)
That has the hyperparameters already in there. Then you can try it outIt's taking up like 2.5GB of ram.
my test query is always "compare rust and go with code samples". I'm telling you, the thinking token count is ... high...
Here's what I got https://day50.dev/rust_v_go.md
I just tried it on a 4gb raspberry pi and a 2012 era x230 with an i5-3210. Worked.
It'll take about 45 minutes on the pi which you know, isn't OOM...so there's that....
Though the only mention I found was in ComfyUI docs: https://docs.comfy.org/tutorials/audio/ace-step/ace-step-v1