undefined

upvote

points

by antirez9 hours ago |

upvote

by prettyblocks9 hours ago|

[-]

I think the biggest case for fine tuning is probably that you can take small models, fine tune them for applications that require structured output, and then run cheap inference at scale. "Frontier LLMs can do it with enough context" is not really a strong argument against fine-tuning, because they're expensive to run.

reply

upvote

by _the_inflator4 hours ago|

[-]

I agree.

Also for certain use cases there are constraints like embedded hardware systems with no internet access. These LLMs have to be trained to specialize for clearly defined use cases under hardware constraints.

Frontier LLMs also are rarely function in isolation instead are orchestrating a system of special units aka subsystems and agents.

While costs and effort are one thing, being able to downsize these monster LLMs through finetuning itself in the first place is extremly valuable.

reply

upvote

by andriy_koval3 hours ago|

[-]

> "Frontier LLMs can do it with enough context" is not really a strong argument against fine-tuning, because they're expensive to run.

I am not expert in this topic, but I am wondering if large cached context is actually cheap to run and frontier models would be cost efficient too in such setting?

reply

upvote

by faxmeyourcode7 hours ago|

[-]

Especially for super constrained applications. I don't care if the language model that I use for my extremely specific business domain can solve PhD math or remember the works of Shakespeare. I'd trade all of that for pure task specific accuracy.

reply

upvote

by arkmm5 hours ago|

[-]

Can you share more details about your use case? The good applications of fine tuning are usually pretty niche, which tends to make people feel like others might not be interested in hearing the details.

As a result it's really hard to read about real-world use cases online. I think a lot of people would love to hear more details - at least I know I would!

reply

upvote

by derwiki9 hours ago|

[-]

Exactly, inference cost is a very good reason to fine tune with something like Qwen

reply

upvote

by Me10008 hours ago|

[-]

Wouldn’t it be better to use a grammar in the token sampler? Tuning is fine, but doesn’t guarantee a syntactical correct structured output. But if the sampler is grammar aware it could.

reply

upvote

by MillionOClock8 hours ago|

[-]

I think both should be done, they don't really serve the same purpose.

reply

upvote

by butILoveLife8 hours ago|

[-]

This is literally what I'm waiting for. I want a ~8B model that works well with OpenClaw.

reply

upvote

by prettyblocks8 hours ago|

[-]

I don't think you will get that anytime soon because for a model to work well with something like openclaw it needs a massive context window.

reply

upvote

by butILoveLife8 hours ago|

[-]

but but but but unified memory! (jk, I don't actually believe in Apple marketing words)

There might be future optimizations. Like, have your small model do COT to find where to look for memory that is relevant.

reply

upvote

by piyh8 hours ago|

[-]

Qwen 9B doesn't?

reply

upvote

by butILoveLife8 hours ago|

[-]

Nothing is really usable outside Opus.

I've tried too. Wasted a few days trying out even high end paid models.

reply

upvote

by throwaway69779 hours ago|

[-]

I agree- I'm currently trying to learn how I can embed a fine tuned tiny model into my c++ game so it can provide a narrative in prose of certain game-event logs. It needs to be as tiny as possible so it doesn't take resources away from the running game.

reply

upvote

by lelanthran2 hours ago|

[-]

> I agree- I'm currently trying to learn how I can embed a fine tuned tiny model into my c++ game so it can provide a narrative in prose of certain game-event logs.

Unless your game states have combinatoral exlosion, would it not be better to generate all of that pre-build? If templated you can generate a few hundreds of thousands of templates to use for any circumstance, then instantiate and stitch together those templates during the game runtime.

reply

upvote

by yw34105 hours ago|

[-]

How small a model are we talking? Don't even the smallest models which would work need gigabytes of memory?

reply

upvote

by lelanthran3 hours ago|

[-]

> How small a model are we talking? Don't even the smallest models which would work need gigabytes of memory?

I dunno, for game prose I expect that a tiny highly quantized model would be sufficient (generating no more than a paragraph), so 300MB - 500MB maybe? Running on CPU not GPU is feasible too, I think.

reply

upvote

by bravura7 hours ago|

[-]

For me, trying to fine-tune a model to write "best day" prose I would accept over 80% of the time.

You are correct if we are talking about knowledge.

However it is bad at hyper-idiosyncratic, gritty style transfer.

I first noticed the issue when asking claude code to draft email responses. The choice of register was off. ("Register in writing refers to the level of formality and tone chosen to suit a specific audience, purpose, and context.")

I decided to talk all my HN comments and rewrite them in various bad LLM prose, and see if I could use DSPy to optimize a prompt using in-context-learning (ICL, I give it 10 examples of my HN comments) and the results were abysmal. RHLF fine-tuned frontier LLMs have a deep seated aversion to the target stylistic distribution of my comments.

I tried fine-tuning qwen3, llama, and gemma models. Instruct models are already so tuned that they could not be tuned. This is using several hunded comments as gold targets and 5 different LLM degradations per gold as the input.

reply

upvote

by HanClinto4 hours ago|

[-]

How well would you say it worked? I do like the idea of taking my historical forum posts and e-mails and whatnot and training an autocomplete LLM that is specifically "my voice".

reply

upvote

by danielhanchen9 hours ago|

[-]

These are fair points considering LLMs are getting smarter and better every week - but to be fair the biggest benefits of finetuning / RL are still not yet realized:

1. If we have robots at home, they need some sort of efficient continual learning, which could be on the go finetuning / RL via some small LoRA - this will need to do multimodal finetuning with sparse reward signals - one could also imagine all data is aggregated to one central processing center after anonymization, and training a larger model with more data + RL like that

2. Agreed images, audio, video etc is what still LoRA does well - the guide at https://unsloth.ai/docs/models/qwen3.5/fine-tune is actually a vision + text finetuning guide, so you can finetune the vision layers on your own use case

3. Model routing is going to be more the norm in the future - ie locally smallish models with LoRA for continuous finetuning can be used, but complex tasks can be offloaded to a large LLM in the cloud.

4. I also wrote about more use-cases below on the post - DoorDash, Vercel, Mercor, Stripe, NASA, Perplexity, Cursor and many others all do finetuning - for eg Cursor, Perplexity finetune large OSS LLMs themselves for their specific product lines - so there is definitely value if you have the data for it.

reply

upvote

by canyon2898 hours ago|

[-]

I work on Gemma and Gemini models I want to echo Daniel's point here. Small finetuned models have their place even with larger general purpose models.

For example last year with Daniel/Unsloth's help we released a tiny specialized model that can get equivalent to Gemini level purpose specifically for FC. For folks that need efficient limited purpose models small models like this can fit a specific need.

https://blog.google/innovation-and-ai/technology/developers-...

Especially on device. https://developers.googleblog.com/on-device-function-calling...

It's the same with chips, we have general purpose CPUs but we still have specialized silicon for tasks that are smaller, more power efficient, cheaper, and because they're single purpose it simplifies and derisks certain designs.

And I have to add, if you want to learn about finetuning models efficiently the Unsloth guides are at the top of my list. They're practical, have all the technical details, and most importantly Daniel and the others are working around the clock to keep it up to date in what is an incredibly fast moving space of models and hardware. I am continually astounded by their work.

reply

upvote

by danielhanchen7 hours ago|

[-]

Function calling and also finetuning with FC is a big use-case across any companies - we constantly see large orgs have internal APIs with some schema, and JSON guided output is good, but finetuning with FC is just much more powerful since the model actually starts to understand how to utilize the tools more effectively!

Nice work with Gemma and Gemini as usual! :) Excited for more cool models this year!

reply

upvote

by abhgh8 hours ago|

[-]

They are great for specialized use-cases: (a) where the problem is not hard enough (you don't need reasoning), or (b) diverse enough (you don't need a world model), (c) you want cheap inference (and you can make it happen hardware-wise) and (d) you either have enough data or a workflow that accumulates data (with fine tuning with enough data you can sometimes beat a premier model while ensuring low latency - ofc, assuming (a) and (b) apply).

I make it sound like a rare perfect storm needs to exist to justify fine tuning, but these circumstances are not uncommon - to an extent (a), (c) and (d) were already prerequisites for deploying traditional ML systems.

reply

upvote

by joefourier8 hours ago|

[-]

Fine-tuning still makes sense for cost/latency-sensitive applications. Massive context windows drastically slow down generation, and modern models' performance and instruction following ability relies heavily on a reasoning step that can consume orders of magnitude more tokens than the actual response (depending on the application), while a fine-tuned model can skip/significantly reduce that step.

Using the large model to generate synthetic data offline with the techniques you mentioned, then fine-tuning the small model on it, is an underrated technique.

reply

upvote

by sweaterkokuro8 hours ago|

[-]

As strong as current LLMs are they are easily distracted from the task often. At production scale, fine tuning can make a lot more sense given you provide the model a very specific task.

reply

upvote

by andsoitis8 hours ago|

[-]

For agentic coding, which do you prefer:

a) qwen3-coder

b) qwen3.5 (general)

reply

upvote

by ranger_danger9 hours ago|

[-]

where it makes sense IMO is when you need it to know about a large amount of information that's not already in the model, such as a company knowledgebase, code repositories or a trove of specialized legal documents... in that case it's not realistic to try to stuff the context window every time with that information, especially if you're trying to make a responsive chat bot.

reply

upvote

by antirez9 hours ago|

[-]

With the current context windows and the ability those models did RL to work as agents, it's much faster and reliable for them to use tools and find the information before replying. Much better, no hallucinations problems (or a lot less), no fine tuning needed when information changes. I believe it is exactly in this case that fine tuning is no longer useful, and even in the past worked at very different degrees of quality.

reply

upvote

by dotancohen9 hours ago|

[-]

Wouldn't a RAG make more sense for this use case?

reply

upvote

by larodi6 hours ago|

[-]

indeed, and in practical terms, this is more often than never, and particularly with large knowledge bases. also makes super sense for VLMs and ViT models.

reply

upvote

by KronisLV8 hours ago|

[-]

> But now, why?

Because these models are good in general but their Latvian output is half-drivel, like the roots of the words are usually the right ones, but not the rest.

That, and EuroLLM is really slow to release new models that would be similarly good off the shelf.

reply

upvote

by esafak9 hours ago|

[-]

I would like model adaptation algorithms like Doc-to-LoRA (https://pub.sakana.ai/doc-to-lora/) to go mainstream.

reply