But my real concern comes at the results. The "13 parameters" looks like bait, because it is one result of finetuning a model on a very simple math benchmark, grade-school-math (GSM8K), an already very saturated benchmark on every model. Besides, it seems to happen only for the qwen family model... It looks like GSM8K was part of the training set of the qwen model, and this tinylora finetuning did the last adjustments to perfectly reflect that overtraining.
>In particular, learning to generate longer outputs may be possible in few parameters
Reminded me of: https://arxiv.org/abs/2501.19393
>we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps
Maybe, indeed, the model simply learns to insert the EOS token (or similar) later, and the capability is already in the base model
Let's say we have a low level programmer expert and we try to teach him algebra either we:
- (SFT): give him algebra book with new nomenclature, definitions, syntax
- (RL): let him learn algebra using C syntaxeven some advanced math usually evolves applying patterns found elsewhere into new topics
[0]: cartesien.io or Salesforce's WebscaleRL
For some use cases it can be parity performance at 1/20th the cost up to exceeds at 1/10th the cost. Trade-off is ofc narrow applicability
*At least up to 300B parameters, based on the models we’ve tested.
The real unlock isn’t TinyLoRA, it’s what this implies: ultra-cheap, continuous adaptation. The bottleneck shifts from compute to having a good reward signal.