undefined

upvote

points

by cubefox7 hours ago |

upvote

by freehorse18 minutes ago|

[-]

Not all tasks require models like opus. If they do not, then it is more efficient to use cheaper and faster models. For most of my tasks now I use the big kimi/qwen/glm models because they are cheap and good enough, if not even the smaller locals ones.

I would say that for a significant part of the current market open-source models are good enough to fill a part of it.

reply

upvote

by PhilippGille7 hours ago|

[-]

The OpenRouter usage stats indicate the opposite: https://openrouter.ai/rankings?view=month

reply

upvote

by jjice6 hours ago|

[-]

OpenRouter usage is likely skewed towards LLMs that are more niche and/or self-hostable by solid hardware that's available, but most consumers don't have on hand. I can imagine Anthropic and OpenAI LLMs often get called directly from their APIs instead.

At least from my experience and friends of mine, we use OpenRouter for cases where we want to use smaller LLMs like Qwen, but when I've used ChatGPT and Claude, I use those APIs directly.

reply

upvote

by elbear3 hours ago|

[-]

I use ChatGPT and Claude on OpenRouter, because it's just easier than buying credits on each platform separately.

reply

upvote

by senordevnyc5 hours ago|

[-]

Same, and my little SaaS is pushing more than 0.1% of the TOTAL volume of tokens on OpenRouter, so the reality is they’re TINY.

reply

upvote

by vorticalbox5 hours ago|

[-]

what happened around jan this year(26) that caused such a climb in usage?

reply

upvote

by wcallahan4 hours ago|

[-]

Openclaw

reply

upvote

by thraxil6 hours ago|

[-]

No. Right now I'm upset that Google has removed (or at least is in the process of removing) the Gemini 2.0 flash model. We use it for some pretty basic functionality because it's cheap and fast and honestly good enough for what we use it for in that part of our app. We're being forced to "upgrade" to models that are at least 2.5 times as expensive, are slower and, while I'm sure they're better for complex tasks, don't do measurably better than 2.0 flash for what we need. Yay. We've stuck with the GCP/Gemini ecosystem up until now, but this is kind of forcing us to consider other LLM providers.

reply

upvote

by toofy4 hours ago|

[-]

this is one of the reasons im hearing more and more people are using open/locally hosted models. particularly so we dont have to waste time to entirely redo everything when inevitably a company decides to pull the rug out from under us and change or remove something integral to our flow, which over the years we've seen countless times, and seems to be getting more and more common.

products entirely disappearing or significantly changing will be more and more common in the llm arena as things move forward towards companies shutting down, bubbles deflating, brand priorities drastically reshifting, etc...

i think, we're at or at least close to a time to really put some thought into which pieces of your flow could be done entirely with an open/local model and be honest with ourselves on which pieces of our flow truly needs sota or closed models that may entirely disappear or change. in the long run, putting a little bit of thought into this now will save a lot of headache later.

reply

upvote

by thraxil2 hours ago|

[-]

Yeah. Back when Gemma2 came out we benchmarked it and were looking at open models. For our use case though, while the tasks are pretty simple, we do need a pretty large context window and Gemini had a big lead there over the open models for quite a while. I'll probably be evaluating the current batch of open models in the near future though.

reply

upvote

by jimbokun4 hours ago|

[-]

What’s interesting about this is that for previous technologies you could define a standard and demonstrate compliance with interfaces and behavior.

But with LLMs, how do you know switching from one to another won’t change some behavior your system was implicitly relying on?

reply

upvote

by elbear3 hours ago|

[-]

In case you don't know, Gemini 2.5 flash is hosted on DeepInfra. They also have 1.5 flash but not 2.0 flash.

I have no affiliation with DeepInfra. I use them, because they host open-source models that are good.

reply

upvote

by thraxil2 hours ago|

[-]

Thanks. Yeah, for now we're moving to 3.1 flash lite as that's the new cheapest at $.25/1M and is also still "good enough". 2.5 flash is more expensive at $.30/1M (looks like Deep Infra charges the same as GCP/VertexAI for it). I might check them out for Gemma though. We benchmarked Gemma2 when that came out and it wasn't remotely usable for us largely because the context window was way too small. It looks like 3 or 4 might be worth evaluating though.

reply

upvote

by Someone12347 hours ago|

[-]

> There isn't, pretty much everyone wants the best of the best.

For direct user interaction or coding problems, perhaps. But as API calls get cheaper, it becomes more realistic to use them for completely automated workflows against data-sets, or as sub-agents called from expensive SOTA models.

For example, in Claude, using Opus as an orchestrator to call Sonnet sub-agents, is a popular usage "hack." That only gets more powerful, as the Sonnet equivalent model gets cheaper. Now you can spawn entire teams of small specialized sub-agents with small context windows but limited scope.

reply

upvote

by alexsmirnov6 hours ago|

[-]

Exactly.

I did create my own MCP with custom agents that combine several tools into a single one. For example, all WebSearch, WebFetch, Context7 exposed as a single "web research" tool, backed by the cheapest model that passes evaluation. The same for a codebase research

Use it with both Claude and Opencode saves a lot of time and tokens.

reply

upvote

by hadlock47 minutes ago|

[-]

I'd be interested in seeing the source for this if you have a moment

reply

upvote

by thinkcontext6 hours ago|

[-]

> But as API calls get cheaper, it becomes more realistic to use them for completely automated workflows against data-sets

Seems like a huge waste of money and electricity for processes that can be implemented as a traditional deterministic program. One would hope that tools would identify recurrent jobs that can be turned into simple scripts.

reply

upvote

by Someone12344 hours ago|

[-]

It depends on the specific task.

For example: "Here our dataset that contains customer feedback comment fields; look through them, draw out themes, associations, and look for trends." Solving that with a deterministic program isn't a trivial problem, and it is likely cheaper solved via LLM.

reply

upvote

by jimbokun4 hours ago|

[-]

That is a very complex, high level use case that takes time to configure and orchestrate.

There are many simpler tasks that would work fine with a simpler, local model.

reply

upvote

by joefourier7 hours ago|

[-]

Ever hit your daily limit on Claude Code and saw how expensive it is to pay per token?

reply

upvote

by sidrag227 hours ago|

[-]

maybe there isnt, but as understanding grows people will understand that having an orchestration agent delegate simple work to lesser agents is significant not only for cost savings, but also for preserving context window space.

reply

upvote

by wongarsu6 hours ago|

[-]

For coding I want the best. Both I and $work do lots of things besides coding where smaller models like qwen3.5-27b work great, at much lower cost.

reply

upvote

by scoopdewoop7 hours ago|

[-]

That isn't true. In a Codex or Claude Code instance, sure... but those are not the main users of APIs. If you are using LLMs in a service for customers, costs matter.

reply

upvote

by Aurornis7 hours ago|

[-]

The market for API tokens is bigger than people like you and I (who also want the best) using then for code.

There are a lot of data science problems that benefit from running the dataset through an LLM, which becomes bottlenecked on per-token costs. For these you take a sample subset and run it against multiple providers and then do a cost versus accuracy tradeoff.

The market for API tokens is not just people using OpenCode and similar tools.

reply

upvote

by wolttam6 hours ago|

[-]

Nope. I get very good results from GLM 5 and 5.1. I’m not working on anything so complex and groundbreaking that I need the best.

Coding is a rung on the ladder of model capability. Frontier models will grow to take on more capabilities, while smaller more focused models start becoming the economical choice for coding

reply

upvote

by regularfry7 hours ago|

[-]

Everyone may want the best, but the amount of AI-addressable work outstrips the budget available for buying the best by quite a wide margin.

reply

upvote

by noman-land7 hours ago|

[-]

OpenCode allows for free inference tho.

reply

upvote

by wolvoleo5 hours ago|

[-]

Not really. It depends on the usecase. For private stuff I'm very happy to take what was SOTA a year or 2 ago if I can have it all running in my home and don't have to share any of my data with some sleazy big tech cloud.

The price is a concern too of course. But privacy is a bigger one for me. I absolutely don't trust any of their promises not to use data for training purposes.

reply

upvote

by esafak6 hours ago|

[-]

That's only because current models don't saturate people's needs. Once they are fast and smart enough people will pick cheaper ones.

reply