For this reason I wonder if local models are a potential business opportunity. Provide the service to engineering teams to give them a pre-built and setup GPU rig they can run in a closet. No need to worry about all the things you mentioned and clients can rest-assured their data isn't disappearing into a sketchy data center. There might be regulatory reasons that make on-prem setups appealing as well.
If this mattered to them, they wouldn't be running so much in the cloud or in proprietary software that they have no ability to air-gap.
If companies ever cared about this, Windows would not be dominant on the desktop.
If open models can ever hold roughly 600k token windows, I'll be really excited, I found that around 300 ~ 400k of Claude reading through your codebase results in better outputs. I also have Claude read official docs instead of "guessing" as to how to do something.
I think deepseek v4 pro has 1m context and does pretty well up to around 600k. But if you have the hardware to run that locally, you already know
Even then if there's a smaller model with 1M context, you'll need a ton of RAM to actually run it at full 1M. I guess that's why you don't see it too much. Anyone that could run Qwen 3.6 27B with 1m context would be better off running a much bigger model with smaller context instead, in the same amount of VRAM.
In terms of optimizing further, huge context + KV quantization sounds like a terrible idea, but there's some decent innovation in sparse attention, KV cache rotation allowing Q8 to perform nearly as well as full 16-bit precision, plus some ideas around offloading KV cache to system RAM (but I'm skeptical)
I think the way these models work excludes sane behaviors the larger the context gets as each token introduces potential ambiguities between "USER" and "SYSTEM" messages leading to all the catastrophic behaviors.
Anyway, with AMD395+ I'm finding ~100k is both speed and context usefulness unless it's scoped tightly. with opencode, I manage it with dynamic context pruning: https://github.com/Opencode-DCP/opencode-dynamic-context-pru... ; then anything I touch ends up being refactored so context doesn't get bloated with unecessary functions, etc.
Obviously, this isn't compatible with certain business codebases, so I can see why bloat meets bloat.
OLLAMA_FLASH_ATTENTION=1
OLLAMA_KV_CACHE_TYPE=q8_0
OLLAMA_CONTEXT_LENGTH=180000
and that fits in 23GB.[edited for format]