undefined

points

[-]

Coding, for some future definition of "small-model" that expands to include today's frontier models. What I commented a few days ago on codex-spark release:

"""

We're going to see a further bifurcation in inference use-cases in the next 12 months. I'm expecting this distinction to become prominent:

(A) Massively parallel (optimize for token/$)

(B) Serial low latency (optimize for token/s).

Users will switch between A and B depending on need.

Examples of (A):

- "Use subagents to search this 1M line codebase for DRY violations subject to $spec."

An example of (B):

- "Diagnose this one specific bug."

- "Apply these text edits".

(B) is used in funnels to unblock (A).

"""

by freakynit11 hours ago|

prev|

[-]

You could build realtime API routing and orchestration systems that rely on high quality language understanding but need near-instant responses. Examples:

1. Intent based API gateways: convert natural language queries into structured API calls in real time (eg., "cancel my last order and refund it to the original payment method" -> authentication, order lookup, cancellation, refund API chain).

2. Of course, realtime voice chat.. kinda like you see in movies.

3. Security and fraud triage systems: parse logs without hardcoded regexes and issue alerts and full user reports in real time and decide which automated workflows to trigger.

4. Highly interactive what-if scenarios powered by natural language queries.

This effectively gives you database level speeds on top of natural language understanding.

by app1311 hours ago|

prev|

[-]

Routing in agent pipelines is another use. "Does user prompt A make sense with document type A?" If yes, continue, if no, escalate. That sort of thing

by mtone6 hours ago|

parent|

[-]

For this type of repetitive application I think it's common to "fine-tune" a model trained on your business problem to reach higher quality/reliability metrics. That might not be possible with this chip.

by mike_hearn5 hours ago|

parent|

[-]

They say LoRA finetunes work.

by zardo11 hours ago|

prev|

[-]

I'm wondering how much the output quality of a small model could be boosted by taking multiple goes at it. Generate 20 answers and feed them back through with a "rank these responses" prompt. Or doing something like MCTS.

by freakynit10 hours ago|

parent|

[-]

Isn't this what thinking models do internally? Chain of thoughts?

by andy12_10 hours ago|

parent|

[-]

No. Chain of thought it just the model generating a single answer for longer inside <think></think> tags which are not shown in the final response. The strategy of generating different answers in parallel is something different (which can be used in conjunction with chain of thought) and is the thing used by models like Gemini 3 Deep Think and GPT-5.2 Pro.

by freakynit10 hours ago|

parent|

[-]

Hmm.. got it. Thanks..

by freeone300011 hours ago|

prev|

[-]

Maybe summarization? I’d still worry about accuracy but smaller models do quite well.

by scotty7910 hours ago|

prev|

[-]

Language translation, chunk by chunk.