undefined

points

by cptskippy14 hours ago |

comments

by hbbio14 hours ago|

[-]

Interesting setup, thx for sharing.

How many tokens/sec do you get with 27b? Are you using MTP?

by askvictor13 hours ago|

prev|

[-]

Does Intel make decent GPUs now? I must be out of the loop...

by speedgoose13 hours ago|

parent|

[-]

They released a few good value GPUs for LLM inference about a year ago: more memory than AMD and NVIDIA consumer GPUs, not too expensive, but also not great tokens/watt.

I am not sure whether you can find those in stock anywhere.

by cptskippy3 hours ago|

parent|

prev|

[-]

I'm using an Intel Arc Pro B70 which has 32 GB of VRAM. It's estimated to get ~35-45 t/s at $21-27 $/t/s. An RTX 5090 is ~61 t/s at ~$33 $/t/s.

So in terms of raw power Nvidia is effortlessly still king, but in price-to-capacity Intel is best in class.

Intel's Battlemage GPUs also natively support SR-IOV and GPU partitioning which allows you to isolate workloads. This is useful in homelab environments if you have workloads that benefit from GPU acceleration. I was able to split the B70 into 4 virtual GPUs and hand them out to Frigate NVR, Plex, and other workloads.

by jauntywundrkind14 hours ago|

prev|

[-]

What's the value running the smaller model too? Why not just the big model for everything? I note both are dense, as well.

by Ritewut14 hours ago|

parent|

[-]

Tokens per second. The difference between 8B and something like 16B is not as big as you might think in practical usage and 8B is a lot faster and interactive than 16B but there are certain things where it is useful to farm it out to the large model.

by Natalia72413 hours ago|

parent|

[-]

Agree. For local coding help, latency often matters more than raw benchmark quality. A slightly weaker model that answers immediately changes how often you reach for it.

by cptskippy2 hours ago|

parent|

prev|

[-]

Exactly this.

Creating conversation titles and parsing HTML/JSON don't benefit from 27B models.

The B70 can run both models comfortably side-by-side so it makes better use of time and resources.