upvote
Does an SSD meaningfully degrade by read only workloads?
reply
Nope, reads don’t cause wear
reply
No appreciable wear of course, but read disturb (requiring occasional rewrites) becomes more of an issue as NAND fabrication advances.
reply
> If I were to run this on a Mac SSD, 24/7 for heavy usage such as Openclaw, that is going to significantly reduce the lifetime of the SSD.

How sure are you about that? I've never looked closer at how a large LLM with mixture of experts architecture switches between expert modules, but staying on roughly the same topic for the use (as it often would when editing the same codebase), I wouldn't be surprised to see the switches of composition are fairly rare, fairly small, and to the extent it happens it's repeated reads from the flash disk rather than writes it tends to cause.

reply
Afaik the experts are not usually very interpretable, and generally would be surprised if at least one does not change every token. I don't know what happens in practice, but I know at least during training, nothing is done to minimize the number of expert switches between tokens.
reply
I'd have thought at least a tiny explicit penalty term for switching, to discourage messing around with the composition without any expected gains from it.

If one is to use these on hardware that can't keep everything loaded I guess someone should examine how it works out in practice. Interpretability may be be a too much to ask, but I can't spontaneously see any reason why the experts can't at least be pushed to incorporate what's needed to remain the good choice for a longer segment.

reply
The switching is done by layer, not just per token. Every layer is loading completely different parameters, you don't really benefit from continuity. You're generally better off shifting this work to the CPU, since CPU RAM is more abundant than the GPU's VRAM hence it matters less that so much of it is "wasted" on inactive expert layers. Disk storage is even more relatively abundant, so offloading experts to disk if you can't keep them in RAM (as OP does) is the next step.
reply
Eh. I mean, 4 tokens a second works fine if you're patient. Go do something else while you wait.

I feel like whenever I'm trying to find information on which local models will work on my hardware, I have to overestimate because people don't know how to wait for things.

Also, reading data doesn't cause SSD wear.

reply
If you want decent throughput and do not care about burning SSD write cycles on a box that was never meant to act like a tiny inference server, a used server with actual RAM is still the cheaper and less silly option. I woudn't expect Apple's warranty team to be much help.
reply
Is it doing a bunch of ssd writes?
reply
stream from the SSD, perform the calculation, discard, repeat
reply