Meanwhile we're even seeing emerging 'engram' and 'inner-layer embedding parameters' techniques where the possibility of SSD offload is planned for in advance when developing the architecture.
With LLMs it feels more like the old punchcards, though.
Local inference using open weight models provides guaranteed performance which will remain stable over time, and be available at any moment.
As many current HN threads show, depending on external AI inference providers is extremely risky, as their performance can be degraded unpredictably at any time or their prices can be raised at any time, equally unpredictably.