undefined

points

[-]

That's correct. Input caching helps, but even then at e.g. 800k tokens with all of them cached, the API price is $0.50 * 0.8 = $0.40 per request, which adds up really fast. A "request" can be e.g. a single tool call response, so you can easily end up making many $0.40 requests per minute.

by acjohnson5517 hours ago|

parent|

[-]

Interesting, so a prompt that causes a couple dozen tool calls will end up costing in the tens of dollars?

by dathery4 hours ago|

parent|

[-]

It essentially depends on how many back-and-forth calls are required. If the model returns a request for multiple calls at once, then the reply can contain all responses and you only pay once.

If the model requests tool calls one-by-one (e.g. because it needs to see the response from the previous call before deciding on the next) then you have to pay for each back-and-forth.

If you look at popular coding harnesses, they all use careful prompting to try to encourage models to do the former as much as possible. For example opencode shouts "USING THE BATCH TOOL WILL MAKE THE USER HAPPY" [1] and even tells the model it did a good job when it uses it [2].

[1] https://github.com/anomalyco/opencode/blob/66e8c57ed1077814c... [2] https://github.com/anomalyco/opencode/blob/66e8c57ed1077814c...

by isbvhodnvemrwvn12 hours ago|

parent|

prev|

[-]

Not necessarily, take a look at ex OpenApi Responses resource, you can get multiple tool calls in one response and of course reply with multiple results.

by jasondclinton18 hours ago|

prev|

[-]

If you use context cacheing, it saves quite a lot on the costs/budgets. You can cache 900k tokens if you want.