undefined

points

[-]

> Try running the latest OS models on a normal Mac or PC.

It can be done through the magic of SSD offload. The worst case involves seconds-per-token speeds, but that's OK if you only care about low volumes of slow unattended inference, which maximizes utilization for the hardware.

(The real worst case, where you're streaming the whole model from the cheapest storage you could feasibly think of, involves multiple minutes per token for a single inference, or even hours per token batch if you're doing many inferences in bulk. That's a lot less helpful, so there's a space for smaller models at the edge, even for unattended workloads.)

by nerdsniper8 hours ago|

prev|

[-]

> I disagree. It is not the model alone. It needs a system which capitalizes on it. And this is very complex.

AFAICT … despite saying you “disagree”, you appear to be agreeing with the parent comment that the model is less important and compute (all that complex infra) and data (also complex infra) are more important.

by trollbridge5 hours ago|

prev|

[-]

An LLM which provides an OpenAI or Anthropic API-compatible interface + a coding harness like OpenCode or oh-my-pi is a pretty easy "ecosystem" to replicate. Exactly what makes you say Fable or Mythos are "systems, not just pure models"?

by everforward5 hours ago|

parent|

[-]

Fable can delegate tasks to Opus or Sonnet, so it has some agentic properties and I believe it does them in parallel.

The parallelism is where this starts to fall apart on a local PC. Like I can run some Qwen quants, but I can’t run a decent Qwen model while also running another model smart enough to actually implement it. I’d have to do them in series, and given how long Fable seems to take even with parallelism, I’d probably be waiting days for an answer.

by trollbridge5 hours ago|

parent|

[-]

oh-my-pi can delegate tasks to other models too. I usually use DS4 Flash for low priority subagent tasks.

If Fable is "delegating" tasks, then there's actually an agent front end of whatever you think the API is.

We have a local instance of Qwen-3.6 which is more than adequate for running agents. You can mix and match local and cloud-hosted models. (My biggest use case for local models right now is vision models because they're quite small and I can avoid some data-locality issues my customers wouldn't be comfortable with if I sen them to a Chinese model.)

by everforward1 hours ago|

parent|

[-]

> If Fable is "delegating" tasks, then there's actually an agent front end of whatever you think the API is.

I would say behind (I believe you use the API just like you do Opus), but yeah. I'm not claiming it's a property of the LLM itself, I also presume this is some variety of tool calling agent harness.

> We have a local instance of Qwen-3.6 which is more than adequate for running agents. You can mix and match local and cloud-hosted models.

I'm presuming OP meant local as in the models run locally as well. I do know you can do subagents in Pi (probably others too), but the vast majority of people are going to hit hardware limitations trying to run them in parallel on local hardware.

I'm doubtful Fable's harness is unique in some way that you can't replicate with Pi. I'm mostly doubtful there are more than a handful of people with hardware sitting in their house that can execute more than one meaningfully smart model at a time.

If you're on local hardware, Deepseek v4 Flash is in the ballpark of 180GB of VRAM alone. Even on smaller models, Qwen + a dumber agent to execute is probably in the realm of 60GB of VRAM.

I do suspect you could get Deepseek to do Fable level things with a good harness (or a bunch of models really, I'm fairly convinced the magic of Fable is in the harness rather than the model).

by ramblurr8 hours ago|

prev|

[-]

> > The bottleneck is compute and data, not the model.

> I disagree. It is not the model alone. It needs a system which capitalizes on it. And this is very complex. Hardware, software, architecture - it takes a lot to get it right.

What do you disagree with exactly?

by christkv8 hours ago|

prev|

[-]

For now I suspect however that the gigantic models are not needed and you will be able to do pretty much what you need in a specific domain with 120b or lower. There is so much trash in the frontier models. I don't need all the world's slam poetry for my coding tasks for example.

by ACCount377 hours ago|

parent|

[-]

Wrong, mostly.

Model capability is a function of model size. Raising the bar raises model performance in every domain.

An "idiot savant" model that's overtrained for a specific domain would beat a generalist model of the same size. But scale the generalist up enough, and it'll trounce the specialist. Removing poetry data from a model training mix doesn't give you much - it might even cost you some performance - and "idiot savant" approach of overtraining for a domain has a hard ceiling.

So far, it seems like there's some equivalent of "g factor" in LLMs - a broad "intelligence" value that performance across many diverse domains correlates with. And, as a rule, larger models have more of it.

by everforward5 hours ago|

parent|

[-]

While I disagree with OP about removing stuff from the model, there’s a valid question about tradeoffs between intelligence and price.

Deepseek Flash is almost certainly wrong more often than Opus or Fable. It also costs like 5% as much.

The question becomes if I run Deepseek in a loop to fix the mistakes it made that Opus/Fable didn’t, can it fix its own bugs in few enough tokens that it’s still cheaper?

So far, the answer seems to be “yes, by a significant margin”. A lot of tasks are simple enough that both Deepseek and Opus or Sonnet can one-shot it, which is a huge cost win for Deepseek. Even on the long tail, it’s usually like 4x the tokens on Deepseek which is still way cheaper than Opus.

There are things that Opus can do that Deepseek just won’t ever really nail, but it happens so infrequently that I just don’t worry. Like most people, most of what I do is the same sort of “3 tier app with a React frontend” that doesn’t take a rocket scientist to work out.

by overfeed6 hours ago|

parent|

prev|

[-]

> Wrong, mostly.

> Model capability is a function of model size

Model effectiveness has improved across model sizes. You really should try the latest flash variants more. They have become my default for most tasks except for gnarly high-level planning.

by trollbridge5 hours ago|

parent|

[-]

Right - the idea that "bigger model = better" might have been true a year ago, but the flash models are extremely effective right now. You simply use them for the tasks they are ideally suited for.

by ACCount376 hours ago|

parent|

prev|

[-]

"Capability per parameter" is rising, but parameter count remains an advantage. And small models remain bad, because "good" is a rapidly moving target.

A 2026 4B beats 2024 4B, but both are far behind the contemporary frontier. Which makes them bad. There is no such thing as "too much capability" - a "good" model is whatever the current frontier is.

In 2024, a "good" model is one that can be trusted to write a 800 line script. In 2026, it's a model that can be trusted to do gnarly high-level planning and execution both. In 2028, it's going to be something like a model you can point at an extremely involved task, abandon, and have it report back with a "done" in 3 weeks.

by overfeed4 hours ago|

parent|

[-]

> A 2026 4B beats 2024 4B, but both are far behind the contemporary frontier.

The thing about engineering is you don't just use the biggest bolt on the market on every bridge.

> In 2024, a "good" model is one that can be trusted to write a 800 line script. In 2026, it's a model that can be trusted to do gnarly high-level planning and execution both

This sounds a lot like having a single diamond-head hammer as the only tool in your toolbox. As suggested by the name, flash models are fast - sometimes I want to write the equivalent of fifty 800-line scripts. There is such a thing as good enough.

by ACCount374 hours ago|

parent|

[-]

Good enough? That's a lie people tell each other because they lack imagination.

"It's good enough" was said about GPT-4, o1, o3, Opus 4 and more. Guess what happened? Newer models released, people updated their expectations of what LLMs can do, usage got more aggressive, and somehow, GPT-4 went from "good enough" to "obsolete trash".

If you have no imagination, then at least substitute your pattern recognition for it.

The world is hungry for capabilities. There are piles upon piles of tasks that aren't done by LLMs simply because LLMs aren't good enough to do them.

The thing a frontier model gives you is "you don't have to babysit a model to get it to do X", and that X gets more and more impressive release to release.

by overfeed4 hours ago|

parent|

[-]

I wish you had addressed at least one of arguments in good faith before jumping to insults and countering a strawman argument I didn't make - I never claimed their will be no use for more capable models.

You do your AI-maximalism, and I'll stick to making trade-offs based on the needs of each piece of work.

by ACCount373 hours ago|

parent|

[-]

I.e. spending your time and effort on making choices that don't matter.

I'll do more "per-task model selection" when AIs themselves get good at it.