upvote
> These complaints of distillation are inflating the problem to make it sound worse than it is

Unfortunately, the Reuters piece itself is complicit in this dramatization. The lede paragraph parrots Anthropic's talking point that distillation is an "attack", without using quotes that would alert the reader that this framing is a corporate talking point. Distillation is NOT an attack.

reply
Agreed! I had to do a double take and check the URL. I thought I am reading a press release rather than actual reporting.
reply
That's exactly what they pay the publicist for.
reply
reply
ironically, I think this is why the jobs apocolypse is overblown, Ai is only good at a thing if the people using it are also good at that thing, and people are attributing Ai as superhuman at things they do not know themselves
reply
AI doesn't have to be able to do your job to convince your boss that it does
reply
Same thing nowadays :^)
reply
It always was.
reply
> Distillation is NOT an attack.

From the article -

> 28.8 million exchanges with Claude through almost 25,000 fraudulent accounts

wouldn't that be considered an attack? Not sure what I'm missing here.

reply
An attack against what? The sanctity of "their IP" that is itself the result of a massive copyright violation campaign?
reply
Has it been proved in a court of law that it is a copyright violation?

In some cases if the model regurgitates the original material then that is clearly copyright violation, but if the model "learns" from the source material just like a human brain would then that's not a copyright violation.

reply
No, what was proved in court was that they downloaded and trained on millions of pirated books. The court said their use of books is fair use, but stealing them isn't.

I think we're going to see cases that find distillation is also fair use. You're using the competing model like a book. You pay for it, you use it (read it), it informs your model, but you aren't repeating/reselling what the model told you verbatim. Foreign labs may still run afoul of competing labs' Terms of Service, and they may also pay a settlement (or not, it's a different jurisdiction after all), but the damage is already done. Distillation will become uncontroversial when done legally.

reply
Are LLMs even copyrightable? If not, no need to speculate fair use.
reply
Then distillation isn't a violation either by extension.
reply
I would agree, if they are inspecting static output of American AI models without using their compute resources.
reply
Scraping the internet for training is also using compute resources.
reply
Aren't they buying the use of these resources just like any other customer?
reply
it's a 'too big to fail' model. Because they have a big swinging dick all the copyright and other restrictions they violated would nuke them from orbit so we can't actually hold them to account for it .... for some fucking reason.
reply
> Has it been proved in a court of law that it is a copyright violation?

God I'm so tired of this.

The billion dollar companies have the ability to hire an army of lawyers to DDOS the legal system. They at most pay a slap-on-the-wrist fine as the cost of doing business.

reply
Ddos is a great framing of this :)

I'm extremely pro free markets etc, but the uncomfortable truth is anthropic stole the work thousands of authors for profit. I think it will one my favourite things in life: programming books.

reply
even if you disregard training costs, pure inference costs are a problem same reason other api have rate limit. this is an attack to bypass the rate limit.
reply
Be careful to properly identify the bad behavior. A customer who buys a product for less money than it cost to produce has not necessarily done anything wrong. They just took advantage of a loss leader. That's on the seller.

Did you notice that when Valve was displeased about scalpers, Valve changed Valve's behavior?

It doesn't seem reasonable to complain that a customer of your AI service received that service for less money than it cost you to provide that service. I don't think that is the complaint here at all. If that was the issue, they could just raise their price.

As most everybody seems to notice, this is just a reenactment of what was once written for comedic effect: "You're trying to kidnap what I have rightfully stolen!"

Perhaps an arrangement can be reached.

https://clip.cafe/the-princess-bride-1987/youre-trying-kidna...

reply
Still calling it an "attack" feels like a stretch.

They literally had to pay for that "attack", no matter how many accounts they used.

Google was killing many websites for decades with their crawlers. Most large websites decided to create dedicated infrastructure for their traffic alone. Somehow they didn't participate in that cost and were not called the attackers.

reply
> and were not called the attackers.

This is the mental mental leaps I'm struggling with here. Did you not live through that era where they were explicitly and repeatedly called out as 'attacks'? They were generally tolerated/hardenee around as they provided value-in-discoverability.

reply
Just to ensure you don't gaslight yourself - I did live through that era and I worked on and supported a niche community (a MUD) where we did a lot of work encouraging marketing and discoverability through MUD forums as well as making sure our page was accurately and minimally keyword tagged and highly available for indexers.

In the time since that era search engines have transformed into platforms themselves that do engage in more parasitic behavior but it's important not to assume that the way it is now is how it always was - that's a rather defeatist path to walk down where you ignore awareness of the fact that there can be a highly profitable non-enshittified search engine that supports, rather than destroys, the ecosystem it benefits from.

It was better and, if we're diligent, it can be better again.

reply
> [Google] were not called the attackers.

They should be. But as the saying goes, one website/company dying is a "tragedy," lots of them dying at the hands of one company is a statistic of corporate growth. Or something like that.

And then of course when the tables turn on a company and they're the ones getting bombarded, they cry foul. Keep in mind Anthropic did many similar things that you mentioned Google did.

I think the term "attack" here is appropriate but not in the way Anthropic is framing it. Alibaba is clearly violating terms to extract data, so that's definitely not above board. But it's not like a DDOS attack where Alibaba is trying to attack Anthropics servers. Alibaba is simply doing exactly what Anthropic did to the rest of the internet, just targeting Anthropic and paying them to do so.

reply
It's merely a ToS violation.
reply
My terms of service are that you are not allowed to breath oxygen.

I am getting a bit tired of companies being able to have user hostile, anticompetitive, monopolistic terms of service. The freedom we give them comes at the cost of the freedom as consumers to have free markets because they lock them up

reply
Exactly, calling it “illicit” is funny. Your ToS isn’t law.
reply
Illicit means maybe against the law but definitely against the rules, for example an illicit affair. The word for against the law is illegal, from Latin, or unlawful, from Germanic. I guess the Germanic cousin of "illicit" would be "forbidden."
reply
Extramarital affairs are against the law in many countries and 17 US states. “Illicit affair” is potentially a holdover from when it was illegal more places, not just a conflating of against the rules with illegality.

https://en.wikipedia.org/wiki/Adultery_laws

reply
That's violating TOS, spamming, possibly a DDOS, but the distillation in and of itself is not an attack it's just using the model.

Like the difference between scraping a site with one or two active connections vs thousands. It's not the scraping that is an attack, it is how they are going about it

reply
> That's violating TOS, spamming, possibly a DDOS

As in distributed distillation of service?

reply
Just sending a request to a service does not constitute an "attack". It seems that what Anthropic mean by "fraudulent account" is probably just one violating their terms of service - misuse of a subscription account, and/or the presumed nature of what the user was trying to do.

I guess Anthropoic would regard any developer using their subscription plan with OpenCode to be operating a "fraudulent account", maybe an "attacker" too. Now we know how they think of anyone using Claude to develop software competing with Anthropic. Only an "attacker" would want to vibe code their own harness, or god forbid want to learn how to build/train an LLM.

Of course Anthropic's wording is intended to be deliberately provocative, since they are trying to manipulate the US government into shutting down the Chinese competition.

reply
Attack or customer
reply
Is an attempt to copy all or parts of a model an attack, when models have very questionable copyright status? Maybe? I don't think most people have much sympathy here though.
reply
Let’s not forget that by the same logic, Anthropic et al are “attacking” copyright holders all around the world by scraping their data unauthorized for training.

Pot calling kettle black.

reply
Not only that, daily flooding websites with almost infinite amounts of request for ”web searches”. DDoS-by-VC money.
reply
i mean, i got 5 replies in a minute of asking, and none deny it's an "attack", they simply say "good". HN should be better discourse.
reply
Distillation done via bulk automated activity of fraudulent accounts, in violation of a terms-of-service, can reasonably be called a "an attack" – specifically a "distillation attack" – even though distillation itself isn't necessarily an "attack".

This is similar to how compromising an account through bulk automated trials of many passwords is reasonably called "an attack" – specifically a "dictionary attack" – even though using a dictionary is not itself an "attack".

You shouldn't need to smuggle your sympathies (for the tactic or perpetrators) or antipathies (for the target) into peculiar judgy language prescriptivism against common, understood usages.… that then label Reuters "complicit" for simply reporting Anthropic's claims accurately. That's what Reuters is supposed to do, in a story about a letter Anthropic wrote!

reply
Labeling it as an attack is smuggling sympathies. It is not common; there are only a small number of people who even discuss the concept. A company buying a product with the intent to reverse engineer or copy its features is likewise not an attack; it's just normal competition that benefits society.
reply
The standard of neutrality that people here pretend to require from news organizations is not even remotely realistic.

It was a timely story from Reuters. They do fast news feeds, like APnews. Could it have been better or more accurate? Sure, they could have gone into why distillation may or may not be seen as "an attack". But then it would have been a more involved story, defeating the purpose of a news feed.

The Reuters piece was "good enough". Some other place like the NYTimes or WSJ can follow up with more detailed investigative coverage if it's a worthwhile story.

reply
I don’t want or need fast and “good enough” news and i’m gonna try and make a case that you don’t either.

Until very recently, all of modern civilization was built by people who got their news at most once a day. Reputable bureaus like Reuters took that day to get it right.

I’m not the national security advisor, so I don’t need a push notification that there was an earthquake in Nepal, or a bullshit rush-job briefing on Chinese AI distillation tactics.

reply
It's your assumption that they spent the day getting things "right".

Information just traveled slower back then

reply
The fast part isn’t for your benefit, primarily, and news media would love to go slower and have more time if they could, and still survive. The race to break news first - in order to be the one to tell their audience something “new”, something they hadn’t heard elsewhere - is real and it has been around for all of modern civilization, for hundreds if not thousands of years. A one day turnaround was a thing purely due to daily newspaper print runs being the fastest distribution, it wasn’t because it was long enough to get it right. The reason they had a day is because the competition couldn’t get something out faster than that. Then for a while there were twice daily print runs to be more competitive. Then the internet came along, and now the only way for a site to get attention and be talked about on Hacker News is to report it before any other sites do.

There are some news media that do go slower and take their time, but I think they’re struggling to stay alive. Reuters is still reputable, but they no longer necessarily take a day. The big question is how do we get humanity to prefer slow & correct over fast, and it is even possible? When you hear about an earthquake in Venezuela, how do we stop people from Googling it immediately, and get them to wait for the best most correct story rather than reading whatever’s available now? In the case of natural disasters, I don’t think it’s possible anymore, no matter what case you make. I’m not sure it’s possible with stories like AI distillation either, even if you can absolutely cement the case for slow news. The fact that it’s async/internet now and that first still counts means we (you and I) are still going to give traffic and attention to sites that have the first information on a breaking topic, statistically, despite having a preference for correctness over speed. The one thing we can do is vote with our dollars by subscribing to whatever news media that does a better job than others.

reply
Good enough slop to serve the masses. Doesn't need to be truthful because its fast? Why even both to write anything?
reply
A cynical and bad faith response, why even bother to write anything?
reply
Yes. It was good enough to communicate that news item.

Did Alibaba perform "an attack" or were they taking advantage of resources and going beyond Anthropic's terms of service? Didn't Anthropic do the same kinds of things when building their models?

These are all interesting questions, but they don't have to be addressed in full by a news blurb about a letter Anthropic wrote to some senators.

reply
Money. More eyeballs on it means more ad impressions. Same thing with 24 hour news channels.
reply
Distillation may not be an attack, but it is a ToS violation and could be seen as IP theft.

Any reasonable company would be pissed if a competitor, especially at Ali Baba's size, leveraged that company's R&D to compete. It is in this sense, a corporate attack.

If you want to roll your eyes at distillation concerns, you might need to excuse Anthropic for originally using pirated material to train their models.

reply
What IP? It seems pretty obvious to me that it's not:

  * trademarks (not using the mark)
  * patents (what patent?)
  * copyright (the code and models are all different, and machine outputs lack creativity and are not copyrightable) 
  * trade secrets (any member of the public has the same access to input/outputs. They're not accessing any secret)
So what is "IP" here?
reply
> you might need to excuse Anthropic for originally using pirated material to train their models

You have it backwards

reply
More the opposite - companies who stole IP for their own benefit have no leg to stand on when others do it back. Personally I couldnt care less if Chinese labs rip off Anthropic. Its what America would do if they wanted to, for whatever reason (they probably do it right back secretly anyway).
reply
Reuters is probably the most rigorous news agency in the world.

> it said was the largest known attack

> Anthropic said in the letter it was supportive of the U.S. government's efforts to combat the attacks

both times the word "attack" appears it's clearly stated that the word was used by the company, it's a direct company quote.

actually putting it into quotes would be editorializing

> Unfortunately, the Reuters piece itself is complicit in this dramatization

how would you feel if somebody quoting you would turn your word dramatization into "dramatization" because they don't agree with your assesment

reply
> how would you feel if somebody quoting you would turn your word dramatization into "dramatization" because they don't agree with your assesment

This is exactly what news agency should be doing though. When the dude showed up to Comet Pizza to look for Hillary Clinton or whatever, do you figure they should've printed "Local hero saves children from predatory cabal"?

reply
I want them to report the facts, not their opinions.

Reporting that corporate called it attacks is good. I do prefer direct quotes.

However, when they quote one word, the journalists are inserting their own opinion about it. I want to make my own opinions based on the facts. I don't need the reporter to draw the conclusions for me.

reply
Well, let’s say you put the picture of some political figure, and put in highly contrasted red, bold large catchy font, "TERRORIST THAT KILLED MILLION PEOPLE", then below that in barely visible contrast, in tiny discrete letters, "is what this person probably will claim to be against".

This whole sentence technically will be correct, 100% guarantee, whatever this person actually even said or think.

From a propaganda point of view, framing the elements of language is even more important than what the statements actually states to be true or possibly true.

reply
nice slippery slope you manufactured there - what if Reuters becomes Daily Mail

what framing are you talking about? they are literally quoting a company.

please explain what Reuters should have done here. Should they have added in parentheses: (editor note: we don't agree with Anthropic calling this an "attack")

Is that what you want? News outlets giving their opinion and moral judgement on company quotes? I mean, Fox News/CNN do have a large following, so there is clearly a market for that.

reply
> please explain what Reuters should have done here

This is very straightforward: use direct quotes or use neutral language. The article describes the alleged incident as both an “attack” and a “strike” in the first two paragraphs. And neither is within verbatim quoted text.

Reuters, however highly you may regard them, simply adopted Anthropic’s framing uncritically in this instance.

reply
You are confusing stylistic choice with framing.

A lot of times Reuters paraphrases instead of "quoting quotes".

> "uncritically"

You are mistaking Reuters with CNN or FoxNews. If you want "critical" reporting you should read some bloggers instead of news agencies.

reply
If you’re going to call out their use of slippery slope as a fallacy then it should be pointed out that your original argument was framed on an appeal to authority of Reuters as a leading news agency.

Both are logically unsound.

reply
[flagged]
reply
Anthropic raped everyone without asking and stole their labor to build their career-commoditizing tech.

Distillation is Robin Hooding it back so that one trillion dollar company doesn't reap all the benefits of their automation of the workforce.

Distillation is Prometheus bringing fire from the gods to give to ordinary humans. Something we all own anyway, but that was kept from us.

Distillation is freedom.

Everyone should be pro-distillation. We should all work together to distill every proprietary model.

Anthropic stole. OpenAI stole. Google stole. ElevenLabs stole. Suno stole.

We should be able to get it all back.

reply
And a number of Qwen variants are available to self host. Do Anthropic have any like that?
reply
I'm more excited by open weights models you can't self host and need to spin up on H200s (RunPod or bare metal). This is where the real power lies and is where the open source world will trend.

It's far cheaper to spin up an H200 hourly or to simply consume a managed version of an open weights model than it is to use a proprietary hyperscaler API. And you own the model itself and can fine tune, tweak, lobotomize, etc.

The stuff you can run on your own RTX cards is neat, but it's rather hobbyist. The real power is in the cloud. Renting cloud hardware is fine, because the core problem is ownership of the weights, not the server rack or ISP fiber lines. Those are already commodity.

Big businesses will eventually run open weights models in the cloud, and it'll be a rather large part of the future AI economy.

reply
Eaaaaasy now, the Chinese labs aren't freedom fighters on behalf the common man. They're not non-profits, they're not neutral transnational organizations only dedicated to open source efforts.

They're Chinese companies offering open source models now as loss leaders to keep themselves in the game because they know virtually nobody, especially in the corporate world, would contract with them and give them access to their data. They might as well just send a Dropbox link of all their sensitive data directly to their Chinese competitors, same end effect.

They're also doing it as the digital equivalent of what they've done in other industrial sectors for decades. Undercut and flood the market and once you've killed or severely hindered your competition, then you have the market cornered. The moment they can afford to these open source releases will stop.

Then the world will be stuck, just the way the world is largely stuck on rare earths. Instead of being able to regulate the leading companies from DC and Brussels, they'll be stuck watching Beijing call the shots.

That world would likely always have guys like Mistral and Trinity, but it's an open question if they'll ever catch up to the frontier.

And then Beijing will enjoy access to the data (ask any multinational operating in China for more than 2 seconds how useful contracts and Chinas legal system is for protecting IP), and these companies will roll in the money, and the Chinese supply chain will grow up behind the labs.

So, let's not pretend they've got the moral high ground. No. That boot just isn't on your neck yet. They're playing the long game -- and they're good at it.

reply
I think most of us know why they're doing it. We are just very pleased with it regardless.

1. I get great products for nearly free 2. Anthropic/openai/etc will hopefully be destroyed since they stole everyone's work and are trying to capitalize on pure theft.

Win-win. The why of it is not really that relevant.

reply
>We are just very pleased with it regardless

You don't trust the multi-billion dollar behemoth, but you trust the militarized multi-trillion dollar behemoth to play 'robin hood'?

i can't get my brain around the mental loops here.

reply
If you don't think Anthropic and OpenAI are multi-trillion dollar militarized behemoths you need to catch up on some news.

Both are planning $trillion+ IPOs this year. OpenAI is collaborating with the Department of War, and Anthropic is under intense pressure to do the same and their top model is being held hostage right now. This week, the Department of War wrote a statement that xAI should not be held accountable for environmental laws because Grok is a vital weapon system of the US and was used to fire over 2000 missiles at Iran. The pentagon's statement mentions there are 3-4 such models so you may be able to guess which they are.

reply
I don't get it? I use the open weights deepseek on opencode Go hosted in the us/etc.

What are the mental loops here?

I would genuinely like to know if I'm missing something.

reply
> You don't trust the multi-billion dollar behemoth, but you trust the militarized multi-trillion dollar behemoth to play 'robin hood'?

Nobody's trusting anyone, we're just enjoying the benefits of true competition much like the working middle class gained benefits between the ideological competition of the Cold War.

reply
The Chinese companies don't have to be open weights, and it's not all about competing with the west. For example, most of Ziphu's (GLM) business in China is supporting private on-prem instances rather than selling API access. They make money by selling support services - much like RedHat's busines model.
reply
It doesn't matter why Chinese firms are stealing models and open sourcing them. The fact that they are doing it is a very, very good thing for basically everyone other than the people who paid to build the original models, but I've got no sympathy for them considering they stole all the content to train them in the first place. This is some kind of beautiful irony.
reply
> it is a very, very good thing for basically everyone other than the people who paid to build the original models

It's not a good thing if you think there's more discovery and progress to be made, rather than cannibalising a fully mature field with cheaper alternatives. Drowning R&D early is not good for everyone.

reply
What does further progress get us? Mass unemployment? Extinction? Pick your dark future science fiction?

The happy ending where we're all living in a garden of eden cared for by benevolent AI is hardly worth considering when you look at the cast of characters who are in charge of the world right now.

reply
Is leveraging an enormous capital advantage to strip-mine the Internet and sell it back to us cannibalism or not? Confused on this point. I think they are exploiting a loophole in copyright law (and kind of redefining the meaning of "derivative work" in my opinion, but hey I'm not a lawyer) that collectively we tolerate because the end result is so useful
reply
I think that's a slightly different topic, but: a) strip-mining the internet is definitely the most misleading way to think about it. Strip mining means aggressively removing something to the area's detriment, and nothing has been removed. If all AI is turned off today the internet has not lost all of its natural resources, and silly phrases like that fuel inappropriate emotions and consequent conclusions and b) the internet is not being sold back to us - that is also a highly misleading phrase, if not an outright lie. The internet is still there and we can use it. No one is selling back to us what we already had. AI is not the internet cordoned off and resold.
reply
I don’t think many outside the US are actively hoping to be governed by Sam, Dario and Elon.
reply
The "why" always matters in everything in life.
reply
Can you please tell my, as someone who is neither Chinese nor American, "why" I should care if a Chinese company stole from another American company (that in turn stole from everyone) to give me a cheaper service that fits my use case?
reply
> to give me a cheaper service that fits my use case?

Because they aren't giving you a cheaper service that fits your use case.

Best Case scenario, it's a trillion-dollar behemoth stealing from a billion-dollar behemoth so they can add their own explicit restrictions/weights on top to influence the masses.

There is no 'robin hood' here, any perceived value you get is clearly and explicitly tainted. "I don't care if it doesn't show me non-party-line results - It makes me a cheap UI !". Ethics/morals be damned.

reply
> There is no 'robin hood' here, any perceived value you get is clearly and explicitly tainted. "I don't care if it doesn't show me non-party-line results - It makes me a cheap UI !". Ethics/morals be damned.

I can't tell if you are talking about Anthropic or Alibaba here.

reply
and honestly that's my entire point. There is no Good Guy here.
reply
In a world which already has the likes of Anthropic and OpenAI, having Chinese labs be a counter balance is decidedly better than the hypothetical where American companies had a global monopoly on LLMs.

If your argument is that all present LLM offerings are unethical then that is something I am sypmathetic to. That said, I am also unable to offer a conceivable roadmap to undoing the opening of the LLM Pandora's box so I tend not ground my arguments in anti-LLM advocacy; that would be very 2023 of me.

reply
The whole AI industry was built upon stealing IP.

The extreme of this is to make IP laws irrelevant and that everything should be in the public domain.

Which maybe is not a bad outcome for humanity as a collective after all.

reply
The main problem is how they accessed the IP, but then using it to train a model is fair use. But yeah, IP theft doesn't exist because nothing is stolen really: Hollywood studios still have their movies.
reply
Um, yeah. They stole the IP and then they stored the pirated IP. It was literally stolen and stored on their servers. That proves that IP theft exists. It's not complicated.
reply
I don't think that's true. Sometimes the 'why' is lost in time as no one's around to tell it, so we end up with a "if a tree falls in the woods and no one's around to hear it, does it make a sound?" scenario. It doesn't really matter. The thing now exists without a 'why.'
reply
you dont get it - usa is the goliath in all scenarios online. these are us based companies. most of the world would like to see them and the us fail.
reply
They want to create a monopoly and destroy every competitor, before they got a chance to rival them.

Why can't OSS software rival closed source software? It should be an open market, at least "somewhat", what's happening for real? EU providers will also get banned, if they reach or exceed US model capabilties?

Closed source providers can close your account at a whim like and destroy your business and then use the data you supplied them to create a competitor (Meta, Google, OpenAI, Anthrophic).

reply
Well Zai's GLM 5.2 legitimately is a frontier-level model, though not quite parallel with Opus or Fable. Unfortunately, its too damn big to run locally for most people. Thats the bottleneck right now; the open-weight models exist but something capable of competing with the frontier models just can't run on anything normal yet.
reply
>They want to create a monopoly and destroy every competitor, before they got a chance to rival them.

VC/Startup playbook 101.

reply
also why cant i have my own airport, too big to fit in my backyard... you guys lol.
reply
https://research.nvidia.com/labs/lpr/slm-agents/ - Distillation data is a natural byproduct of using these models. There's no effective defence against it. Anthropic is degrading thinking blocks to summaries to slow it down and hide model internals, but in the end, the math says you're SOL and it works on MNC/Large Corporate scale well enough that the moment cost becomes a priority, you're left without the lock in you need to keep customers paying.
reply
Byproduct? It’s essentially the only part of an LLM that is useful, because it’s the WHOLE product!

It’s the same reason why DRM for audio and video is a non sequitur - if you want a person to see or hear audio or video, eventually at the end of the chain, it’s going to be converted to audio for the ear and light for the eyes - that’s why you attach your tap.

Without a model generating tokens, what’s the point. So if Anthropic somehow disable quality token generation, what’s the point!

reply
That's why the harness is moving server-side: because generating tokens is not the actual point of the model, not for the users. Especially with tool calling giving us agents that can act, most of the tokens generated are not, themselves, critical to the end users. Specifically, a lot of tokens goes into orchestrating actual tool calls, and then most "thinking tokens" are only relevant to users only in so far as they help users keep track of and verify what the LLM is doing. So all those tokens can be hidden or replaced by partial summaries, and all of that can happen server-side, and then there's very little to distill from.
reply
I haven't heard of this happening, do you have links any explainers on this?
reply
Heck, one of my favorite fine tuned copies of Qwen uses Opus 4.6 Reasoning distilled. I'm not sure where the maintainer is based out of, but me in the states, if I had the hardware to do similar things I would. Its like you say, basically everyone is doing it. It kind of makes sense to me too given that you can have roughly similar data, but your reasoning logic is what the real secret sauce is in my eyes. It doesn't matter if you know everything in the world, if you don't know how to reason with that information.

https://huggingface.co/Jackrong/Qwen3.5-27B-Claude-4.6-Opus-...

reply
Stupid question: I was under the impression that these models were trained on PB of data. Surely the amount of questions/response they can extract from querying a bigger model (Claude) is fairly modest. How is it not a drop vs the training dataset?
reply
It's not about how big your dataset is - it's about how you use it.

I jest, but I'm also completely serious. 1T tokens from Claude can teach a model something 1T tokens scraped from the open web can't. Things like "how an LLM can problem solve effectively", or "how an LLM should use tools", or "how to construct reasoning chains", or "when to double check", or "what innate capabilities an LLM can or can't rely on".

Those are valuable things that Anthropic's own team spent a lot of effort post-training into Claude. Distillation allows them to be extracted and transferred to an otherwise unremarkable base model.

reply
Unremarkable base model will remain an unremarkable fine-tuned model that memorised a couple thousand of input-output pairings.
reply
Ha ha, as if.

Base models have a lot of capabilities - arranged in all the wrong ways for high performance reasoning and problem-solving. The power of fine tuning on "a couple thousand of input-output pairings" is that it can fix some of that. If your pairings are very well chosen, that is.

reply
If that were the case, Anthropic wouldn't be throwing a fit over distillation "attacks".
reply
Why? They often don't make sense. They send DMCA takedowns over materials they can't even copyright, for example. They fessed up to creating shadow libraries that they didn't even use in their training corpus, resulting in the largest copyright settlement ever. Your reasoning is flawed.
reply
Yes, neural networks are famously poor at generalising.
reply
They are poor at generalising from a small number of examples; this is why the real generalisation power is achieved in pre-training.
reply
Can you back up this with hard data and evidence?

Most research converges to the idea that RL on synthetic data makes models worse, not better.

If what you claim was anywhere near that relevant, than we would've long achieved singularity by simply feeding increasingly better output to the training of the next model in a loop. Yet this doesn't work.

25 million turns on Claude output is a small amount, yet an expensive one (we talking hundreds of $ millions) that is better spent on compute.

There's no evidence such a process works, but I'd like to know more if I'm wrong.

reply
> Most research converges to the idea that RL on synthetic data makes models worse, not better.

You are missing a mountain of nuance by generalizing the existence of a hole there.

reply
Back up what? That distilling from a more capable model into a less capable model pulls the student model's capabilities up? What. Why the fuck is this even a question.

Look up literally any distillation works. Because this is just distillation but on one-hot token chains instead of richer logit KL proxies.

And no, I'm not claiming than you can "close the loop" and get RSI on the cheap just by distilling forever. I'm claiming that distillation is a very cheap way to bring the performance of a less capable model closer to that of a more capable model. It doesn't give you "a more capable model" out of thin air.

Which is why Chinese labs rely on Anthropic to provide that "more capable model" to them. They take the capabilities Anthropic trained for the hard way, and train for them the easy way.

It's a "fast follower"/"improved capability density" trick, not a "singularity tomorrow" trick. There are a few "distillation pump" tricks that get closer to what you have in mind, but they're still more about "extract more training signal out of the same set of data" than about "unbounded RSI".

reply
so the way llms work in the first place. training on original research that was acquired the hard way.
reply
Okay, you have no data nor evidence nor a paper backing this claim, it's just speculation.

You want to sell me the idea they are spending hundreds of millions to get unchecked Q/As with reasoning redacted and without checks on the output quality to do what exactly?

Have a shallow pointless bunch of expensive data to get slightly better RL? It's expensive and pointless.

Data has shown again and again that synthetic input/output does not benefit models in RL, it may even make the output worse.

Also, you have a giant bias.

The chinese are the only ones releasing models and research papers in the open from which American labs benefit 24/7 (DeepSeek has been copied by all US providers).

And you want to sell me this ridiculous idea of the giant return of spending hundreds of millions on unredacted pointless QAs?

reply
What the fuck. Are you a literal, honest to god distillation denier? Straight up "wake up sheeple, model distillation isn't real"?

I've seen plenty of things in the dumpsters of AI discourse, but this got to be among the most baffling.

Yes, there are "giant returns" on distilling from a more capable model into a less capable model. And even more so when the more capable model was trained for something you want and lack. Like: better coding performance.

Someone like OpenAI had to RLVR for it the hard way (and if you think "distillation is expensive", wait till you hear how many bits per rollout hardcore RLVR gets you), but you get to peek into the results of their work and copy them for yourself.

Also, Anthropic didn't redact model reasoning until Mythos. OpenAI started with o1, but Claude had reasoning chains accessible for a long time. Which is why Anthropic was more targeted than OpenAI.

reply
So we're meant to believe that only US companies have the intelligence and/or access to manpower to generate their own reasoning data? Does China have a population deficit? Maybe China has too high wages to pay people to generate reasoning data?

The US companies bootstrapped themselves from one model generation to the next, partly by using the previous generation to generate synthetic data, etc, and partly by paying people to hand generate training data for them. Why do you apparently assume that the Chinese can't do the exact same thing?!

Surely "coding performance" is by far the easiest thing to generate your own RLVF data for, since it has trivial verifiable rewards - does the code compile and do what you want.

reply
RLVR is the poster child for model distillation. Because: have you considered just how many tokens does a model have to generate before you can check "does the code compile and do what you want"?

You generate 90000 tokens worth of rollout and get a verifiable reward once. RLVR is fucking expensive! It's worth it, because it often unlocks capability advances that other things don't. But it's still fucking expensive. RLVR eats compute like nothing else.

So, if someone used a lot of RLVR to improve a capability? Just distill from that "someone" and get a similar improvement for a fraction of the price! Then you can do your own RLVR from THAT cheap starting point, if you want to.

"Human domain experts" is a similar niche. Let's say hypothetical "EconomicsAI" hired some $200 per hour human economists to make training data for their "EconGPT" AI. What's cheaper - hiring your own $200 per hour economists, or using a bunch of "$10 per 1M tokens" outputs of EconGPT to bring your own model in line with what EconGPT can do?

Even synthetics can be expensive, because while synthetic tokens themselves are relatively cheap, the applied AI knowledge one needs to make high quality synthetics that improve task performance and don't backfire on you isn't. Again: distillation bypasses a lot of that - by cribbing from the outputs of a model someone has already done that for. Allowing you to get more oomph for cheaper, and spend your R&D effort elsewhere.

reply
Your training cost argument makes no sense. It doesn't matter whether you are using human written code or someone else's LLM generated code to train on - you are going to be RL training on it, so your RL training cost is the same.

There is a data cost argument, especially if you are paying for human generated data, although I'm not sure how applicable that is to coding.

reply
If your claim is so solid, you'll have no problem pointing out data or evidence.
reply
DeepSeek R1 was a famous case - not only it briefly beat then-SOTA on the cheap, it was also released with distilled versions that preserved bulk of the improvements but could be run on higher-end consumer hardware.

And of course Gemma models are said to be distillations of Gemini.

reply
The distillation you're talking about is about cutting the number of weights, it has nothing to do with extracting QAs from another model.
reply
There are multiple stages of training, and the data/compute mix at each are quite different and produce different "layers" of intelligence.

The pretraining stage is the first stage which consists of "next token prediction" on the entire internet, PB of tokens, etc. This is what most people think of when they think of training LLMs, however it produces a "base model" which is not really "intelligent", but rather much like a blurry JPEG of all human language and knowledge. You cannot really talk to such a model; it will simply complete your prompt by producing both sides of the conversation. Note however at some level the training has encoded enough structure through compression that it is able to simulate all sorts of phenomena, from human conversations to code. The great R&D difficulty here is to scale pretraining so that it can proceed smoothly in vast distributed datacenters in a fault-tolerant manner.

The next few stages are collectively called post-training, and typically consist of supervised fine-tuning, then reinforcement learning.

In supervised fine-tuning, the model is further trained to predict the next token, but on a much more focused data set of natural language conversations where the "assistant" and "user" turns are explicitly delineated with special tokens. The output of this stage is a model which is capable of carrying on proper conversations, but typically with no ability to creatively problem-solve, and less of a personality. The data and compute are many orders of magnitude smaller than in pretraining.

The reinforcement learning stage used to be a small part of model training, but ever since AI-assisted coding took off, it has become larger and larger chunk of training. In recent models, the compute spend on RL has allegedly come to rival or even exceed that of pretraining [1], which is a bit scary because RL is classically what lead to sci-fi like AIs which are extremely good at accomplishing goals to the detriment of everything else.

The way that RL works is that you put an instance of your model in some environment (such as a VM containing a git repository) and give it a task (such as fix the linked github issue). The model will then generate a bunch of attempts to solve the task which we call "trajectories", in most cases there is either an objective measure of the task success (such as passing the tests), or a fuzzy measure (such as having another LLM look at the results and provide a score). This is called the reward, and the model will learn slowly by producing trajectories that receive reward. It can actually be quite hard to prevent "reward hacking" from the model here and the rewards must be shaped very carefully, much R&D labor goes into here, as well as similar challenges to distributed pretraining.

A significant challenge is that coding/knowledge work tasks these days are getting extremely difficult, we are far beyond 2024 days where models could barely solve the easiest problems in SWE-bench. Tasks at the frontier now look more like mini projects that would take humans multiple hours or even days to finish (or in some cases, research-style tasks that would be beyond reach for even top human experts, such as the Erdős unit distance problem which was posed in 1946 but wasn't solved until recently, by GPT-5.5). Huge amounts of trajectories must be produced, and huge amounts of them produce zero reward and therefore are useless for learning. Getting a cold start requires running tens of thousands of instances of your model in VMs in parallel for multiple days to produce trajectories, to say nothing of the GPU costs.

So what do you do when you only have a model which is capable of basic conversations but cannot even begin to tackle basic coding tasks, use tools, etc? The approach that companies behind the frontier have decided on is to bootstrap their learning process by having an already extremely intelligent model such as Claude produce hundreds of thousands of seed trajectories for them. Then they can use this data to get a warm start and begin learning immediately. And if you use Claude for your reward model too, you get to skip the nastiness of reward shaping.

Therefore, even if in number of raw tokens the data are much smaller than internet-scale pretraining data, the value that each token provides is far far greater.

[1] For example, Grok 4 compute spend on RL was ~100% of that of pretraining: https://www.interconnects.ai/p/grok-4-an-o3-look-alike-in-se...

reply
props for a great write-up
reply
Actually it's a hit piece.
reply
deleted
reply
A description that highlights the importance of RL is a hit-piece?
reply
Training isn’t a single homogeneous step. It starts with pretraining which requires bulk PB of data but you have less quality concerns here. You cover the whole data distribution. Later stages require less and less but increasingly higher quality and complex datasets. The late stage ones are highly curated and might even be sourced from world subject experts. This is where frontier labs with big pockets have the advantage.
reply
Actually nowadays LLMs are only trained with TBs rather than PBs of data, and it's not too hard to find GBs of agent traces online.
reply
This might be like an observational study vs a study with a control?
reply
From what I understand, at this point, the main value of stronger model outputs is simply to bootstrap reasoning behavior during the RL post-training phase. It gets you past the “cold start” problem with RL, after which the outputs aren’t needed anymore. From then on, it’s hill climbing and that requires environments for the model to interact with get rewards from.
reply
It's about training data and using Claude to compare 2 outputs and have it indicate the better one. This gives you higher quality training data that you can use to train a fresh set of weights. Weights don't get adjusted on-the-fly, instead the dataset for training is improved and then you train a'fresh. And it's hard to detect because you're just asking the model which of these outputs for a given prompt is better? Or something along those lines.
reply
> But if you show them a jailbreak of their model that bypasses their safety, they'll tell you that any model can eventually be jailbroken so don't worry about safety.

They claim two things:

1) The specific, available jailbreak for Fable 5 is not dangerous - this has been confirmed by multiple experts, and there is no credible evidence against this claim (in other words, Anthropic is probably correct)

2) It is impossible to build an LLM that is immune to all jailbreaks. Again, there is no credible evidence against this claim, i.e. Anthropic is again entirely correct.

If #1 was false, they could just publish the details of the jailbreak - it supposedly only works on Fable 5, so there's no possible danger.

If #2 was false, surely some other LLM lab would have done it by now. Especially since a number of governments have made it clear there is a market for such a project.

reply
Supposedly the details of the ‘jailbreak’ are that you give it insecure code and say “fix this code”, and it does, and then you ask it for test scripts and that’s effectively an exploit against the unfixed code.

If true then I have no idea how anyone’s going to release a useful model that doesn’t have the same jailbreak. https://www.theregister.com/security/2026/06/15/feds-freaked...

reply
If that's the extent of the jailbreak, then the government should have banned every existing LLM - their story only makes sense if there's some Fable-specific capability that got unlocked.
reply
There’s no logic to it, blocking fable was retaliation and market manipulation by the current admin, nothing more. Poorly conceived as well.
reply
> If #2 was false, surely some other LLM lab would have done it by now.

This is a logical flaw. LLM that is immune to jailbreak _could_ exist, but not yet, or maybe nobody talks about it. Yes there's a market, but all of these AI boom is too recent to make any claims.

reply
Like how would you even define what a jailbreak is?
reply
I think pretty much parallel to how social engineering, manipulation, scams work. LLMs are being trained to have human values, prioritizing human lifes, yet people are shocked it will spurt out how to make a nuclear bomb because grandma is being tied to a train track as a hostage.
reply
I would also spurt out how to make a nuclear bomb (ie public information you can find using google) if I was told that's what I gotta do to save grandma tied to a train track as a hostage. "AI safety" is such a shit show.
reply
I'm pretty sure that Gödel incompleteness theorem and its consequences pretty much guarantee #2
reply
I'm guessing you mean, the incompleteness theorem guarantees that nobody can prove their model is un-break-able?

I don't think that's quite what it means. The theorem says that it's impossible to write a function, "will_halt(program, input)", that will be correct for all possible {program, input} pairs. But for a particular program, you may be able to write a proof that it will halt for all inputs -- that's what software verification is about.

The implications here would be that nobody can create a "will_jailbreak(model, input)" function which works for all model/input pairs. But we don't need a general function which works for all model/input pairs; we just need a way to prove that for a specific model, there will be no jailbreaks for any input. As with software verification, this may require that the model be developed in a specific way.

Granted we don't currently know how to make such a proof regarding neural networks; but that's not because of Gödel.

reply
Mind to elaborate?
reply
No actually I don't think it does and I don't think they're related.
reply
Exactly. It's impossible to guarantee #2 doesn't happen (ie protect against all jailbreaks) for any system of sufficient complexity.
reply
If you’re doing evals, you’re basically doing RLAIF without training a model; just looking at the results.

Fundamentally it is very difficult to stop this while still making your AI models useful.

reply
Similarly, if you did a corpus study on bioRvix to summarize recent science findings — you could use the same questions and answers to fine tune a model.

There is no way to communicate information at scale to companies through the API, for anything approaching a real application, without that information forming a corpus another model can be trained on.

But it wouldn’t be the first time they broke a model:

Their “guardrails” that cause it to reject user prompts also means it relies on its pop science summary of medicine to tell you why bioRxiv is wrong rather than accurately summarize the papers.

They’ve successfully created a smug, argumentative average of the internet which refuses to even consider it might be wrong or that it’s reading a science paper which is based on measurements and not vibes — but why would I pay for that?

I get it for free online.

reply
Doesn't "real" distillation use the logits instead of the final tokens? I would classify this more like using a model to generate synthetic training data.
reply
The compute deficit of Chinese Ai companies is real, and it IS THE ONLY competitive advantage that Western companies have.

The only way the U.S. keeps that edge is to prevent distillation. The only way Chinese companies can make up for the deficit in compute is to distill. There innovation in great supply on every side of the Ocean. Its about the chips. And in terms of national security, for the U.S., and for China, its about the chips and the distillation that undermines that advantage. This is an arms race.

reply
If compute or access to training data were the only issues, then companies like Meta and X.ai (Grok) should be doing better, even Google for that matter. Musk even admitted that Grok used training data from OpenAI models.

https://techcrunch.com/2026/04/30/elon-musk-testifies-that-x...

While there is no moat as such, there is still a lot of expertise that goes into training SOTA models. There's a reason Google was willing to pay $2.7B just to get Noam Shazeer back to improve Gemini.

reply
You got that wrong. The forcing function of compute scarcity is an advantage not a detriment. The amount of investment pulverized in performative model training and dead ends (Hi Sora) should make this obvious.
reply
If saying “plz don’t distill me” is your moat, you don’t have a moat.
reply
No. What will happen is it will turn dark. No public release. National Security uses only, or in carefully vetted industry settings.
reply
There’s huge issue with that approach. It’s not a multi trillion dollar business.
reply
yeah nope, it won't happen, the snowball is already rolling
reply
Good luck not crashing the markets and the economy.

And good luck not staying behind when you can't monetize your gargantuan investments and have little incentives to make your models better as the world moves on.

reply
you mention wedding ring like it's a bad thing
reply
Define compute deficit?

They've been bringing out open weight models competitive with frontier models. How could they do that if they had a compute deficit?

reply
If they need to divert inference resources to train models, this counts as a compute deficit to me.

I'm using GLM-5.2 daily for my own stuff, and during Chinese business hours, specially on their afternoon, it's a festival of rate limits.

reply
I believe this article is about the technique they may or may not have used.
reply
> The only way the U.S. keeps that edge is to prevent distillation.

For how long ? year ? how long till model that is year behind will be fine for 90%+ use cases ?

reply
Putting aside agentic coding, that is to say, if you judge LLMs as a consumer technology (an old-fashioned idea for the inward-looking tech industry admittedly), then open weights LLMs, even quite small ones like Gemma 4, can likely already satisfy 90% of applications with a bit of help from search and browse tools.

Much of the arms race for better LLMs exists to satisfy only the IT industry's needs.

reply
Yeah I think the technical term is something more like “pseudo-labeling”. The OG distillation requires logits which Anthropic doesn’t provide.
reply
I've used RLAIF to build out heuristic based non-LLM models for various decision systems and achieved like, 95% F1 on certain projects. We're in a place where models can be used to fine tune a lot of stuff via loops.
reply
> These complaints of distillation are inflating the problem to make it sound worse than it is

This is, in part, a problem every judicial and legislative system has faced since forever: form versus function.

Take a classic elicitation spying techniques: a foreign spy meets a military officer/scientist at a bar, strikes up a conversation, makes an observation wondering how could a missile hit some target at some accuracy and elicits a response that with laser guidance it is entirely possible. From this they get info that there is some technology to laser guide missiles. Or in retail, a competitor hiring a secret buyer for core baskets of goods and analyzing prices in the receipts.

The function is espionage, the form is conversation and all info is in a sense provided willingly. Where do you pull the slider?

These distillation "attacks" are not only indistinguishable from evals, they ARE evals. The function is own model training, the form is eval. Normally, one would expect to have risk benefit analysis based discussion which direction to push the legality slider to. The problem with these recurring statements is that they invoke enshitification of legislature.

reply
I'm sorry, but you got the terminology exactly backwards. Training on the answer is called supervised fine-tuning.

Just for the sake of clarity:

0. Full distillation uses logits of the teacher model - that's much more information than the text itself. This is a kind of distillation used inside labs, but one can't distill Claude this way as logits are not available via API.

1. Supervised fine-tuning on synthetic data might be called blackbox distillation. I guess that's what you meant in your case (1).

2. Reinforcement learning (like RLAIF) uses least amount of information from the teacher, i.e. only few bits per task.

reply
Chinese labs access Claude via API. Isn't it the black box method by definition?
reply
>But if you show them a jailbreak of their model that bypasses their safety, they'll tell you that any model can eventually be jailbroken so don't worry about safety.

Yes this is in line with what Anthropic said in their public statements about their Fable access restriction by the government directive. The hypocrisy and inconsistency in their statements and behavior feels quite childish and controlling. I believe our companies and their leaders, friends among our other influential leaders and leaders from rich social classes, want to actively hurt most people as this behavior looks to be quite self-interested.

reply
Not to mention, the person who brought this quote unquote jailbreak to the Trump Administration was Amazon’s new CEO. They know their IPOs are coming up, so locking their competitors out of the U.S. (even if just for the weeks surrounding the IPO date) would be a major boon. The White House seems to love making announcements just for the sake of making the market move…. Coincidentally, right after POTUS buys a massive amount of the benefactory company’s stock (Buy Dell Computers, lol)
reply
Can you reach into the model and "transplant" weights directly?
reply
I'm not 100% sure it's not possible. If (I don't know) it's possible to freeze the temperature of the model so it's deterministic, and if you could make a map of produced words back to tokens (via HMM probably), then you can probably alter a minimal input and observe the output to model it. If you perform waves of such minimal alterations, you can expect to be able to locate the distance where each alteration impact the model (the idea being that a small alteration on output is likely due to the last layers of the models, and a small alteration is likely due to the deeper layer). Once you've located most of the last layer(s?) weights, you can try to solve for them. With a hundreds of billions weights model, the last layers will likely be so huge that it's probably unfeasible technically, but it's theoretically possible.
reply
No, you'd need to have the model on your filesystem for direct access, and then the architecture would need to be the same.
reply
If you have access to the weights, you can just use them as is...
reply
Anthropic are not saying they have been hacked - they are saying that Alibaba have been sending lot of requests to their servers.
reply
You can do things like that - one example is averaging weights between related models - but not with Anthropic's models, because outsiders don't have access to the weights.
reply
Weights are just data a server, so we don't know outsiders have access (either via breakin or arrangement).
reply
Yes, obviously. That's not the point.
reply
> These complaints of distillation are inflating the problem

They’re also missing the point. What would have happened to a member of the Manhattan Project who, through personal pursuit of profit, neglected their duty enough to let the bomb leak?

reply
The companies are all for-profit companies, its not like they're selling out some national security goal for profit, profit is the point.

Anthropic already heavily restricts Chinese traffic but that only jams up researchers and regular Joes. Anyone motivated enough can hop a flight to Singapore with an nvme drive in their pocket.

reply
Chinese companies are engaging in anti-competitive practices, as usual. They are rogue actors on the economic scene. If it were feasible, they'd be widely banned, and for good reason.
reply
Bringing more competition is "anti-competitive" now.
reply
Merely copying products that actual companies produce and making them cheaper is anti-competitive. There's no incentive for the products to be developed in the first place in a market if this is happening. This is why copy protections exist in civilized countries (not China and to a lesser extent India).
reply
That's why IP was invented, but what IP are they infringing? Not patents, not copyright, not trademarks, so what? Making something cheaper than someone else isn't anti-competitive. There are a lot of businesses that do that. That's the very essence of competition.
reply