undefined

points

[-]

I’m interested in learning more about your theory that these models can be trained more cheaply. Is anyone doing it from scratch, rather than adversarial distillation?

by 2ndorderthought11 hours ago|

parent|

[-]

It is a lot cheaper to train a 27b model such as qwen3.6 which you can even vibe code or agentic code with than it is to train a 1t+ parameter model. It runs on a single commodity GPU for goodness sake

It's not a theory. These smaller models that are coming out are huge advances for the field.

I can't comment on companies training practices. That would be proprietary stuff I guess. I think the claims that the advances being made are due to distillation alone are completely unfair. The advances alone are not just data.

by freeone300010 hours ago|

parent|

prev|

[-]

It almost doesn’t matter if it’s trained using adversarial distillation - if it’s nearly as good, and one-hundredth the cost, the choice is obvious.