TinyLoRA – Learning to Reason in 13 Parameters

upvote

TinyLoRA – Learning to Reason in 13 Parameters

(arxiv.org)

185 points

by sorenjan4 days ago |

upvote

by dollo_75 hours ago|

[-]

Not sure if I buy it. First, SVD decomposition to obtain U, Σ, V is computationally expensive, so it would work only if we are not finetuning very big models.

But my real concern comes at the results. The "13 parameters" looks like bait, because it is one result of finetuning a model on a very simple math benchmark, grade-school-math (GSM8K), an already very saturated benchmark on every model. Besides, it seems to happen only for the qwen family model... It looks like GSM8K was part of the training set of the qwen model, and this tinylora finetuning did the last adjustments to perfectly reflect that overtraining.

reply

upvote

by sachaa5 hours ago|

[-]

Fair points, especially on GSM8K saturation and Qwen possibly already sitting close to the solution. That said, even if this is mostly "last-mile alignment", the fact that it can be done with such a tiny signal is still interesting, it suggests the gap between capability and behavior might be much smaller (and cheaper to bridge) than we assume.

reply

upvote

by robrenaud5 hours ago|

[-]

Yeah, my big problem with the paper is it just might be an artifact of qwen's training process.

reply

upvote

by taneq32 minutes ago|

[-]

In all fairness most of the unique stuff I can do is probably an artifact of my training process, so it seems unfair to deny an LLM the same accomodation.

reply

upvote

by kgeist5 hours ago|

[-]

>One theory is that the knowledge required to solve the task is already stored in the parameters of the model, and only the style has to change for task success

>In particular, learning to generate longer outputs may be possible in few parameters

Reminded me of: https://arxiv.org/abs/2501.19393

>we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps

Maybe, indeed, the model simply learns to insert the EOS token (or similar) later, and the capability is already in the base model

reply

upvote

by MASNeo4 hours ago|

[-]

Is it an Aprils Fools publication?

reply

upvote

by darkxanthos21 minutes ago|

[-]

This hit me too hard.

reply

upvote

by Xx_crazy420_xX4 hours ago|

[-]

If i understand it correctly, the analogy could be:

Let's say we have a low level programmer expert and we try to teach him algebra either we:

  - (SFT): give him algebra book with new nomenclature, definitions, syntax
  - (RL): let him learn algebra using C syntax

reply

upvote

by measurablefunc10 hours ago|

[-]

With four parameters I can fit an elephant, and with five I can make him wiggle his trunk so there is still room for improvement.

reply

upvote

by esafak9 hours ago|

[-]

Except learning to reason is a far cry from curve fitting. Our brains have more than five parameters.

reply

upvote

by voxelghost9 hours ago|

[-]

After a quick content browse, my understanding is this is more like with a very compressed diff vector, applied to a multi billion parameter model, the models could be 'retrained' to reason (score) better on a specific topic , e.g. math was used in the paper

reply

upvote

by sdenton47 hours ago|

[-]

It's the statistics equivalent of 'no one needs more than 640kb of RAM'

reply

upvote

by ekuck9 hours ago|

[-]

speak for yourself!

reply

upvote

by est9 hours ago|

[-]

reasoning capability might just be some specific combinations of mirror neurons.

even some advanced math usually evolves applying patterns found elsewhere into new topics

reply

upvote

by measurablefunc8 hours ago|

[-]

I agree, I don't think gradient descent is going to work in the long run for the kind of luxurious & automated communist utopia the technocrats are promising everyone.

reply

upvote

by 9 hours ago|

[-]

deleted

reply

upvote

by a-t-c-g9 hours ago|

[-]

The quality of custom models trained with proper reasoning datasets[0] even with small parameters (3-7B is sweet spot) is incredible now

[0]: cartesien.io or Salesforce's WebscaleRL

reply

upvote

by objektif8 hours ago|

[-]

What are you basing how good they are on? Personal experience or some benchmarks?

reply

upvote

by a-t-c-g7 hours ago|

[-]

Benchmarks, we have internal ones testing reasoning fine-tuned v/s frontier + prompts

For some use cases it can be parity performance at 1/20th the cost up to exceeds at 1/10th the cost. Trade-off is ofc narrow applicability

reply

upvote

by Sim-In-Silico7 hours ago|

[-]

[dead]

reply

upvote

by evermore6111 hours ago|

[-]

[dead]

reply

upvote

by ValveFan69696 hours ago|

[-]

[dead]

reply

upvote

by ValveFan69699 hours ago|

[-]

[dead]

reply

upvote

by matt1234567897 hours ago|

[-]

Such low dimensionality of the LoRA vector must surely result in a close-to-linear modification to the KV calculation. This seems to me to imply that what we call "reasoning" is latent within the model. Pretty clear I didn't read the paper, I'm sure the authors address this.

reply

upvote

by a-t-c-g7 hours ago|

[-]

Yes - some degree of reasoning appears to be latent in the structure of language itself. But models trained explicitly on reasoning-focused data still perform better than models trained only on general corpora.*

*At least up to 300B parameters, based on the models we’ve tested.

reply

upvote

by sachaa5 hours ago|

[-]

If 13 parameters can unlock better reasoning, then we will not be "training" models, we'll be steering them. Most of the capability is already there.

The real unlock isn’t TinyLoRA, it’s what this implies: ultra-cheap, continuous adaptation. The bottleneck shifts from compute to having a good reward signal.

reply