upvote
"-t 8 matches physical cores. The machine has 16 SMT threads but only 8 cores. On a memory-bound workload, oversubscribing threads adds scheduling cost without adding throughput: the cores are waiting on DDR3, not on each other."

But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?

I also dont understand the explanation of "--cpu-moe". If an expert has ~ 4.0 GiB of Parameters, why does optimizing the sequence of experts minimize cash trashing? With 20 MiB of L3 Cash vs 4.0 GiB of Parameters, it wont cash any noticeable amount of the Parameters, will it?

As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.

reply
> But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?

Waiting in terms of latency. When the bus is mostly empty and it takes a while to make a round trip it's great to try to find a few extra passengers to put on it. When the buses are all completely full adding the extra riders just makes the bus stop that much more chaotic.

reply
This is ironically a pretty solid use case for (ex VLIW research) ILP-optimizing compilers.

Given knowable runtime hardware usage patterns (huge bursts of memory bandwidth saturation) and a single limited core/thread-shared resource (memory bandwidth), one could optimize for the constraint ahead of runtime.

Because most of the performance optimization levers you have available to pull are (a) trade compute for memory bandwidth (e.g. compression), (b) preload when memory bandwidth is available, (c) optimize the choice of what's in cache when, (d) align to cache size / memory boundaries.

Or tl;dr, try to approximate GPU ISAs at the CPU compiler level. (Which why would anyone but hobbyists, because everyone else just buys pallets of Nvidia/AMD or designs their own ML chips?)

reply
deleted
reply
Fantastic practical achievement!

I wonder if I could get similar or even better performance from similar Dell T7610 workstation with dual Xeons and also 128GB DDR3?

The CPUs are better core wise, but that probably does not make much difference?

It has CPUs 2 × Xeon E5-2697 v2

Cores / threads 24 cores / 48 threads total

Per-CPU cores 12 cores / 24 threads

Base clock 2.70 GHz

Max turbo 3.50 GHz

It is sitting gather dust but reading spead Gemma sounds promising.

reply
You sure you got DDR3 .. I have 2 e5 v4 rigs at home and both have ddr4 ... Unless I am wrong and 2011-3 supports ddr3 and ddr4
reply
I have a dual e5 v3 that had ddr 4 as well. Been going strong for ten years and still overpowered for what I use it for.
reply
I won't speak for cafkafk, but I have two E5 (v3/v4) systems one on DDR4 and one on DDR3. This generation of CPU all support DDR4, but a few skus do support DDR3 also. ChatGPT told me they were niche products to meet specific customer needs.

I just picked up the DDR3 board, an Aliexpress "XD3" so I could reuse some DDR3 ram on a better CPU. Quad channel 1866MT/s is not bad!

reply
The first two generations supported DDR3 only. Haswell and Broadwell (v4) brought DDR4 support.
reply
right, and they talk about "v4" which is DDR4.
reply
This seems remarkably suited to my situation,

    CPU(s): 32
      On-line CPU(s) list: 0-31
    Vendor ID: GenuineIntel  
    Model name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz

Also with 128G. Does 8 dimm sockets imply more actual bandwidth in practice?

This poor thing is currently a YouTube watching box.

reply
One thing to note: These Xeons have quad memory channels, that usually means double the bandwidth of an equivalent desktop CPU, if you populate all the slots.

I have a dual E5-2667 v2 server with 512GB DDR3 and it's quite nice, the memory bandwidth is higher than of a DDR4 desktop with a way newer CPU, even though it's ECC and registered.

reply
(purple on black is really hard to read)

You say it runs "at reading speed". Have you benchmarked it?

reply
> (purple on black is really hard to read)

Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.

> You say it runs "at reading speed". Have you benchmarked it?

At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:

llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128

Gives:

  llama_print_timings:        load time =   83911.65 ms
  llama_print_timings:      sample time =      26.99 ms /   128 runs   (    0.21 ms per token,  4742.15 tokens per second)
  llama_print_timings: prompt eval time =     343.41 ms /     7 tokens (   49.06 ms per token,    20.38 tokens per second)
  llama_print_timings:        eval time =   10639.36 ms /   127 runs   (   83.77 ms per token,    11.94 tokens per second)
  llama_print_timings:       total time =   11114.98 ms /   134 tokens
So 11.94 tokens per second while it's also playing binary cache and CI builder.

When I do it properly, I'll add it to the blog as well!

reply
And if you ever run out of things to do in your copious free time, it looks like that PR #1744 was merged without the has_target_ctx assert two days after you uploaded your drafter quants. So you can now redo all your quants and rerun all your benchmarks ;-).
reply
> two days after you uploaded your drafter quants. So you can now redo all your quants and rerun all your benchmarks ;-)

2010s Javascript, putting down the controller: Ha, no one will ever surpass my high score for wasting programmer time with dependency churn...

2026 Open Source ML: Hold my beer.

reply
What's time to first token? Raw throughput is usually not the problem in local setups in my experience.
reply
I am pretty sure llamacpp have their own benchmarking binary that you can use.
reply
llama-bench is part of the llama-cpp package, but from recent experimentation, the settings it is able to (or is documented to?) accept lag behind somewhat. Not sure whether it would accept all of the esoteric settings in the article?
reply
20 tokens per second for eval time is the killer here. It means you can't use this to process any meaningful amount of text.

A GPU typically processes close to 1000 tokens/s during eval.

reply
The prompt is literally "why is the sky blue?" and consists of 7 tokens.

It's probably too small for the timings to be taken seriously.

reply
I'm pretty sure eval time is token generation time where it's actually outputting new tokens. If you're getting a thousand per second on that, I'd love to know on what.
reply
From the prompt timings above, it seems like 'prompt eval time' is the equivalent to 'processing time for input tokens'.

Hyperscalers can perform this evaluation very quickly because evaluation can be significantly parallelized. The layer `i` output of token `j` only requires access to the layer `i-1` output of all previous tokens, so a parallel frontier develops. Token (0,0) [(token, layer)] is processed first, then tokens (0,1) and (1,0) can be processed in parallel, then (0,2), (1,1), and (2,0), and so on.

The maximum parallel width becomes equal to the number of layers in the model. Gemma 4 26B-A4B model discussed in this article evidently has 30 layers, giving a 30-fold speedup if the system were otherwise unconstrained (all layers can be run in parallel, and one full set of layer outputs is completed in the KV pass for each pass of the parallel sweep).

In the specific output above, however, the input prompt is only seven tokens long so there are probably considerable non-amortized spinup effects at play.

reply
Seven tokens long input isn't very realistic, is it? For coding tasks it's normal for the input to be thousands or 10s of thousands. If it wasn't for prefix caching it'd be one miserable experience, but even then at the very best the input is often in hundreds each time. And don't even try to dump some logs into the prompt.
reply
> Seven tokens long input isn't very realistic, is it?

The test prompt above was "Why is the sky blue?", so there's the seven tokens. I meant to highlight that because I'd expect processing of a thousand-token input to be faster per token than presented.

reply
I meant prompt eval time.
reply
Something doesn't add up here. As someone who has only recently built a home-server from an E5-26xx v2 on DDR3 RAM (because I have a sh*tload of 32g DDR3 DIMMs), I can confidently say that the newer cores (E5-26xx v3 and v4) only run on DDR4 memory...

So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)

Everything else doesn't work

reply
There are some OEM-only v3/v4 parts with dual memory controllers (because of a RAM supply crunch at the time, funnily enough), but the E5-2620 v4 is not one of them. The classic example is the very popular 12-core E5-2678 v3.
reply
This is not true. A few well known brands made both DDR3 and DDR4 servers that support v3 & v4 chips. Ask me how I know :-)
reply
enlighten us
reply
It looks like Supermicro had some DDR3 Xeon v3/v4 boards, and the first thing that came to mind was a Shenzen workstation/gaming board using recycled parts... haven't searched on that but it's bound to exist.
reply
Yeah, the Intel reference page only lists DDR4, not DDR3:

https://www.intel.com/content/www/us/en/products/sku/92986/i...

reply
> So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)

Yup that's odd... I've got a Xeon 2680 v4 (14 cores) (amazing bargain of a little beast btw) and it's indeed on DDR4 and I saw all Xeons v4 as supporting DDR4 only.

Full spec (brand/model/mobo type) would have been nice: mine's an HP Z440 workstation repurposed as a server (which I only turn on when I'm working and which I religiously turn off before going to bed).

reply
How many watts is that setup? Cool you got it to work, but maybe only useful for vintage / retro computing rather than practical if the energy consumption makes it economically wasteful.
reply
IDK about OPs setup, but I run a pile of E5-2683v4 Xeon recycled servers for Ceph and self hosted business SaaS usage.

One node's ipmitool sensor report (and self-monitoring PSU, so grain of salt, but my UPS side monitoring tracks closely), reports 250-300w average power use. This though, mind you is for running 22 spinning disks, 2 SAS/SATA SSDs, and 4 NVME ssds, and 768GB of DDR4.

Mid-gen 2015ish Xeons were not great at power reduction, but if you are pegging the cores, they were never particularly slow, and they did have lots of PCIe lanes. This boils down to the CPU/mobo itself not being that big a cost floor, especially if you have high utilization rates.

As a comparison, my main desktop development machine, running a Threadripper 9970X, 128GB of DDR5, a RDNA4 GPU, and a small pile of NVME drives has a power floor of roughly 250W. Some CPU centric workloads you'll definitely lose out on on the older gens of machines, but they are by no means impractical.

Maybe for a desktop usecase they are absolutely suboptimal nowadays, but for a lot of realworld usecases I would say they're still relevant.

---

Like the author posts for the LLM usecase, I think optimizing the hardware choice to the application and not leaving levers unpulled is a big key, especially considering how wide a variety of bandwidth/power draw/peak frequency/corecount SKUs exist in the Xeon lines. Without knowing what you intend to run and fitting the correct processor to it, you will end up with a disappointingly poor environment fit.

reply
How many kWh to fabricate a brand new machine better suited to the task?

As long as performance is useable (apply your own metrics!), pulling it from existing hardware is likely the option with the lower eco footprint.

Also: chances are it'll only be used for this purpose occasionally, and/or for a short while. In that scenario [fabricating new hardware] always has the bigger eco footprint.

reply
I don’t know why you’d assume that an older system is lower footprint.

If you’ve got something consuming 100 watts average over your 24 hour period, and your electricity costs 20 cents per kWh, you’re already spending almost as much as a Claude subscription.

Just on electricity, this assumes your hardware never fails and you never incur any additional costs.

There’s a big reason why newer more efficient hardware is in demand. Something that’s 10+ years old has drastically worse performance per watt.

Obviously I am not saying to throw away your old hardware as a rule but there is a point where some of this old stuff just isn’t even worth running.

reply
The reason more performance/watt is in demand because a datacenter can't suddenly draw twice as much power.
reply
I have two LARGE Xeon systems of this era that I used to use when I was heavily involved with Kubernetes and needed to build out a home lab. One is 2x Xeon w/ 256 GB of ram, and one is 1x Xeon w/ 512GB of ram. Both are slow as dogs, and both of them take up at least 150+ watts with only one power supply. My 12th gen Intel Nuc is so, so much faster and efficient. I'm recycling the Xeon systems.
reply
Xeon is a group of products with really varying specs. There is no indication of which XEONs. Also new consumer CPUs often have really small internal caches.
reply
E5-2690s in my case.
reply
You mention lower footprint but then make a cost comparison against Claude subscription pricing.

Claude subscription pricing is a broken way to consider footprint.

reply
You can call it whatever you want, money is money, and money spent on energy is footprint.
reply
Would you consider improving the website's layout? Right now I find it below average quality and very distracting. Whether you are an engineer or not is not really important; great engineers can write horrible text or use a layout that is not ideal, for instance.
reply