undefined

points

by aliljet9 hours ago |

comments

by mordae5 hours ago|

[-]

Look at GB/s.

Strix halo has 256 GB/s bandwidth for $2500. The Flash model has 13 GB activations.

256 / 13 = 19.6 tokens per second

Except you cannot fit it into the maximum RAM of 128 GB Strix Halo supports. So move on.

Another option is Threadripper. That's 8 memory channels. Using older DDR4-3200 you get roughly 200 GB/s. For $2000.

200 / 13 = 15.4 tokens per second

But, a chunk of per-token weights is actually always the same and not MoE, so you would offload that to a GPU and get a decent speedup. Say 25 tokens per second total.

Then likely some expensive Mac. No idea.

Eventually you arrive at a mining rig chassis with a beefy board and multiple GPUs. That has the benefit of pipelining. You run part of the model on one GPU and move on, so another batch can start on the first one. Low (say 30-100) tps individually, but a lot more in parallel. Best get it with other people.

by revolvingthrow8 hours ago|

prev|

[-]

For flash? 4 bit quant, 2x 96GB gpu (fast and expensive) or 1x 96GB gpu + 128GB ram (still expensive but probably usable, if you’re patient).

A mac with 256 GB memory would run it but be very slow, and so would be a 256GB ram + cheapo GPU desktop, unless you leave it running overnight.

The big model? Forget it, not this decade. You can theoretically load from SSD but waiting for the reply will be a religious experience.

Realistically the biggest models you can run on local-as-in-worth-buying-as-a-person hardware are between 120B and 200B, depending on how far you’re willing to go on quantization. Even this is fairly expensive, and that’s before RAM went to the moon.

by zargon8 hours ago|

parent|

[-]

Flash is less than 160 GB. No need to quantize to fit in 2x 96 GB. Not sure how much context fits in 30 GB, but it should be a good amount.

by redrove8 hours ago|

parent|

[-]

It seems to be 160GB at mixed FP4+FP8 precision, FYI. Full FP8 is 250GB+. (B)F16 at around double I would assume.

by zargon8 hours ago|

parent|

[-]

There is no BF16. There is no FP8 for the instruct model. The instruct model at full precision is 160 GB (mixed FP4 and FP8). The base model at full precision is 284 GB (FP8). Almost everyone is going to use instruct. But I do love to see base models released.

by awakeasleep9 hours ago|

prev|

[-]

The same way you fit a bucket wheel excavator in your garage

by floam8 hours ago|

parent|

[-]

Very carefully

by zozbot2348 hours ago|

prev|

[-]

Run on an old HEDT platform with a lot of parallel attached storage (probably PCIe 4) and fetch weights from SSD. You'd ultimately be limited by the latency of these per-layer fetches, since MoE weights are small. You could reduce the latencies further by buying cheap Optane memory on the second-hand market.

by datadrivenangel8 hours ago|

prev|

[-]

A loaded macbook pro can get you to the frontier from 24 months ago at ~10-40tok/s, which is plenty fast enough for regular chatting.

by 5424589 hours ago|

prev|

[-]

The low end could be something like an eBay-sourced server with a truckload of DDR3 ram doing all-cpu inference - secondhand server models with a terabyte of ram can be had for about 1.5K. The TPS will be absolute garbage and it will sound like a jet engine, but it will nominally run.

The flash version here is 284B A13B, so it might perform OK with a fairly small amount of VRAM for the active params and all regular ram for the other params, but I’d have to see benchmarks. If it turns out that works alright, an eBay server plus a 3090 might be the bang-for-buck champ for about $2.5K (assuming you’re starting from zero).

by jdoe1337halo9 hours ago|

prev|

[-]

More like 500k