upvote
I’m interested in learning more about your theory that these models can be trained more cheaply. Is anyone doing it from scratch, rather than adversarial distillation?
reply
It is a lot cheaper to train a 27b model such as qwen3.6 which you can even vibe code or agentic code with than it is to train a 1t+ parameter model. It runs on a single commodity GPU for goodness sake

It's not a theory. These smaller models that are coming out are huge advances for the field.

I can't comment on companies training practices. That would be proprietary stuff I guess. I think the claims that the advances being made are due to distillation alone are completely unfair. The advances alone are not just data.

reply
It almost doesn’t matter if it’s trained using adversarial distillation - if it’s nearly as good, and one-hundredth the cost, the choice is obvious.
reply