upvote
I would agree if it wasn't for the fact that extracting that volume of data from a properly secured corporate network should be hard. It should raise some flags if a such a high volume of data is downloaded to a user's local machine from the training or production environments.
reply
I have no proof one way or the other if Anthropic or OpenAI have "properly secured corporate networks". Both seem like fast changing places with lots of servers and workers. Seems most likely to me that someone somewhere made a mistake or missed something due to all the change and their network is not 100% secure.

But even if their networks are secure, I think that spies who are willing to coerce people, trick people and go in person to data centers or offices could find a way to get those models and other things.

reply
Aren't the models also distributed to various data centers--i think it's very easy with resources
reply
It depends how secure they are. But yes - in reality they are only a couple of TB, so just distributing the models and their source code (not their training data) it feasible.
reply
I mean, the source for claude code was "leaked" by accident so at least some of their processes are not that secure. I feel that they are more like a Startup then a Enterprise (ignoring finances).
reply
There are sooooo many exfil methods, including with air gapped systems that are off-network.

Not at all beyond the capabilities of any of the top ~9 or so best State actors.

Edit: To answer your question, very easily on the 20TB.

One crude method with a simple device in particular works well if you just clone the monitor data and then use HDMI and pass through. Then just cat dir in encrypted chunks to something like a USB key connected to the passthrough. 4TB USB keys are out there. A week of that gets you 20TB.

reply
How many of those methods can realistically exfiltrate 20Tb of data? That's quite hard even for well funded actors.
reply
It's highly unlikely that actors have access to model weights etc..

What is likely is that 'understanding of techniques' could be leaked.

Often, it's just well enough to know 'the approach' being used.

reply