upvote
For conversational purposes that may be too slow, but as a coding assistant this should work, especially if many tasks are batched, so that they may progress simultaneously through a single pass over the SSD data.
reply
Three hour coffee break while the LLM prepares scaffolding for the project.
reply
Like computing used to be. When I first compiled a Linux kernel it ran overnight on a Pentium-S. I had little idea what I was doing, probably compiled all the modules by mistake.
reply
I remember that time, where compiling Linux kernels was measured in hours. Then multi-core computing arrived, and after a few years it was down to 10 minutes.

With LLMs it feels more like the old punchcards, though.

reply
At least the compiler was free
reply
The point of doing local inference with huge models stored on an SSD is to do it free, even if slow.
reply
You are just trading opex for capex. Local GPUs aren't free.
reply
True, but this is not only a trade-off between opex and capex.

Local inference using open weight models provides guaranteed performance which will remain stable over time, and be available at any moment.

As many current HN threads show, depending on external AI inference providers is extremely risky, as their performance can be degraded unpredictably at any time or their prices can be raised at any time, equally unpredictably.

reply
Rather, Imagine you have 2-3 of these working 24/7 on top of what you're doing today. What does your backlog look like a month from now?
reply
[flagged]
reply
@dang
reply
Batching many disparate tasks together is good for compute efficiency, but makes it harder to keep the full KV-cache for each in RAM. You could handle this in an emergency by dumping some of that KV-cache to storage (this is how prompt caching works too, AIUI) and offloading loads for that too, but that adds a lot more overhead compared to just offloading sparsely-used experts, since KV-cache is far more heavily accessed.
reply