There are two parts to this too. One is the raw model capability and the other is how well the harness guides the model and meets its expectations. I really think for stuff like agentic coding, this has to be treated as a package. This is my favorite example of how much difference a harness can make even for a tiny model
https://github.com/itigges22/ATLASAnd you're bang on with the storage comparison, we're basically in the mainframe era of this tech, but there's no reason to think that it won't get optimized to the point where you can run the equivalent of current frontier locally.