upvote
> But how will a GPU with small-ish but fast VRAM and great compute, augment a Mac with large but slow VRAM and weak compute?

It would work just like a discrete GPU when doing CPU+GPU inference: you'd run a few shared layers on the discrete GPU and place the rest in unified memory. You'd want to minimize CPU/GPU transfers even more than usual, since a Thunderbolt connection only gives you equivalent throughput to PCIe 4.0 x4.

reply
But isn’t the Mac Mini the weak link in that scenario?
reply
It has way more unified memory than your typical dGPU.
reply
Yes obviously. That VRAM is also slower and has weak compute attached. Loading to the external GPU will slow things down too much.
reply
My Mini is actually the smallest model so it actually has "small but slow VRAM" (haha!) so the reason I want the GPU for are the smaller Gemmas or Qwens. Realistically, I'll probably run on an RTX 6000 Pro but this might be fun for home.
reply
We've seen many recent projects to stream models direct from SSD to a discrete GPU's limited VRAM on PCs.

How big a bottleneck is Thunderbolt 5 compared to an SSD? Is the 120 Gbps mode only available when linked to a monitor?

reply
That’s what, 14GB/s? The GPU‘s VRAM can do 100x that.
reply
A discrete consumer GPU card doesn't have enough fast RAM to run a very large model that hasn't been quanitized to hell.

That's why all the projects streaming models into the GPU from an SSD popped up recently.

reply
Yes. There’s just no way to get above 1t/s that way with a large model.
reply