But ... isnt that a classic use case for SMT? Giving T1 sth. to do while T0 is waiting on DDR(3) and vise-versa?
I also dont understand the explanation of "--cpu-moe". If an expert has ~ 4.0 GiB of Parameters, why does optimizing the sequence of experts minimize cash trashing? With 20 MiB of L3 Cash vs 4.0 GiB of Parameters, it wont cash any noticeable amount of the Parameters, will it?
As mentioned by others, only some Intel Xeon E5-2xxx v4 did support DDR3, and according to Intel, the E5-2620 v4 is not one of them.
Waiting in terms of latency. When the bus is mostly empty and it takes a while to make a round trip it's great to try to find a few extra passengers to put on it. When the buses are all completely full adding the extra riders just makes the bus stop that much more chaotic.
Given knowable runtime hardware usage patterns (huge bursts of memory bandwidth saturation) and a single limited core/thread-shared resource (memory bandwidth), one could optimize for the constraint ahead of runtime.
Because most of the performance optimization levers you have available to pull are (a) trade compute for memory bandwidth (e.g. compression), (b) preload when memory bandwidth is available, (c) optimize the choice of what's in cache when, (d) align to cache size / memory boundaries.
Or tl;dr, try to approximate GPU ISAs at the CPU compiler level. (Which why would anyone but hobbyists, because everyone else just buys pallets of Nvidia/AMD or designs their own ML chips?)
I wonder if I could get similar or even better performance from similar Dell T7610 workstation with dual Xeons and also 128GB DDR3?
The CPUs are better core wise, but that probably does not make much difference?
It has CPUs 2 × Xeon E5-2697 v2
Cores / threads 24 cores / 48 threads total
Per-CPU cores 12 cores / 24 threads
Base clock 2.70 GHz
Max turbo 3.50 GHz
It is sitting gather dust but reading spead Gemma sounds promising.
I just picked up the DDR3 board, an Aliexpress "XD3" so I could reuse some DDR3 ram on a better CPU. Quad channel 1866MT/s is not bad!
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz
Also with 128G. Does 8 dimm sockets imply more actual bandwidth in practice?This poor thing is currently a YouTube watching box.
I have a dual E5-2667 v2 server with 512GB DDR3 and it's quite nice, the memory bandwidth is higher than of a DDR4 desktop with a way newer CPU, even though it's ECC and registered.
You say it runs "at reading speed". Have you benchmarked it?
Noted, and agree (it looks like it has also already been clicked, which I dislike). I honestly I need to redo the themes.
> You say it runs "at reading speed". Have you benchmarked it?
At some point a few weeks ago, yes I think so, but I didn't write it down for some reason... so I'll have to find a time when it's not busy and do it again without a noisy system. Right now the system is noisy, but that said doing it like this:
llama-cli --model gemma-4-26B-A4B-it-Q8_0.gguf --model-draft gemma-4-26B-A4B-t-assistant-GGUF/wikitext-2-raw_ik-llama-mtp_drafter-conservative/gemma-4-26B-A4B-it-assistant-Q8_0.gguf --spec-type mtp --draft-max 3 --draft-p-min 0.0 --color -sm graph -smgs -sas -mea 256 --split-mode-f32 --temp 0.7 --cpu-moe -t 8 --flash-attn on --mla-use 3 --merge-up-gate-experts --special --mlock --run-time-repack --spec-autotune --no-kv-offload --parallel 8 --jinja -p "Why is the sky blue?" -n 128
Gives:
llama_print_timings: load time = 83911.65 ms
llama_print_timings: sample time = 26.99 ms / 128 runs ( 0.21 ms per token, 4742.15 tokens per second)
llama_print_timings: prompt eval time = 343.41 ms / 7 tokens ( 49.06 ms per token, 20.38 tokens per second)
llama_print_timings: eval time = 10639.36 ms / 127 runs ( 83.77 ms per token, 11.94 tokens per second)
llama_print_timings: total time = 11114.98 ms / 134 tokens
So 11.94 tokens per second while it's also playing binary cache and CI builder.When I do it properly, I'll add it to the blog as well!
2010s Javascript, putting down the controller: Ha, no one will ever surpass my high score for wasting programmer time with dependency churn...
2026 Open Source ML: Hold my beer.
A GPU typically processes close to 1000 tokens/s during eval.
It's probably too small for the timings to be taken seriously.
Hyperscalers can perform this evaluation very quickly because evaluation can be significantly parallelized. The layer `i` output of token `j` only requires access to the layer `i-1` output of all previous tokens, so a parallel frontier develops. Token (0,0) [(token, layer)] is processed first, then tokens (0,1) and (1,0) can be processed in parallel, then (0,2), (1,1), and (2,0), and so on.
The maximum parallel width becomes equal to the number of layers in the model. Gemma 4 26B-A4B model discussed in this article evidently has 30 layers, giving a 30-fold speedup if the system were otherwise unconstrained (all layers can be run in parallel, and one full set of layer outputs is completed in the KV pass for each pass of the parallel sweep).
In the specific output above, however, the input prompt is only seven tokens long so there are probably considerable non-amortized spinup effects at play.
The test prompt above was "Why is the sky blue?", so there's the seven tokens. I meant to highlight that because I'd expect processing of a thousand-token input to be faster per token than presented.
So either you have a v2 instead of a v4 (and run on DDR3 memory), or you have a v4 but with DDR4 memory (not DDR3)
Everything else doesn't work
https://www.intel.com/content/www/us/en/products/sku/92986/i...
Yup that's odd... I've got a Xeon 2680 v4 (14 cores) (amazing bargain of a little beast btw) and it's indeed on DDR4 and I saw all Xeons v4 as supporting DDR4 only.
Full spec (brand/model/mobo type) would have been nice: mine's an HP Z440 workstation repurposed as a server (which I only turn on when I'm working and which I religiously turn off before going to bed).
One node's ipmitool sensor report (and self-monitoring PSU, so grain of salt, but my UPS side monitoring tracks closely), reports 250-300w average power use. This though, mind you is for running 22 spinning disks, 2 SAS/SATA SSDs, and 4 NVME ssds, and 768GB of DDR4.
Mid-gen 2015ish Xeons were not great at power reduction, but if you are pegging the cores, they were never particularly slow, and they did have lots of PCIe lanes. This boils down to the CPU/mobo itself not being that big a cost floor, especially if you have high utilization rates.
As a comparison, my main desktop development machine, running a Threadripper 9970X, 128GB of DDR5, a RDNA4 GPU, and a small pile of NVME drives has a power floor of roughly 250W. Some CPU centric workloads you'll definitely lose out on on the older gens of machines, but they are by no means impractical.
Maybe for a desktop usecase they are absolutely suboptimal nowadays, but for a lot of realworld usecases I would say they're still relevant.
---
Like the author posts for the LLM usecase, I think optimizing the hardware choice to the application and not leaving levers unpulled is a big key, especially considering how wide a variety of bandwidth/power draw/peak frequency/corecount SKUs exist in the Xeon lines. Without knowing what you intend to run and fitting the correct processor to it, you will end up with a disappointingly poor environment fit.
As long as performance is useable (apply your own metrics!), pulling it from existing hardware is likely the option with the lower eco footprint.
Also: chances are it'll only be used for this purpose occasionally, and/or for a short while. In that scenario [fabricating new hardware] always has the bigger eco footprint.
If you’ve got something consuming 100 watts average over your 24 hour period, and your electricity costs 20 cents per kWh, you’re already spending almost as much as a Claude subscription.
Just on electricity, this assumes your hardware never fails and you never incur any additional costs.
There’s a big reason why newer more efficient hardware is in demand. Something that’s 10+ years old has drastically worse performance per watt.
Obviously I am not saying to throw away your old hardware as a rule but there is a point where some of this old stuff just isn’t even worth running.
Claude subscription pricing is a broken way to consider footprint.